Tim Bray

Tim Bray

7 months ago • •

Tim Bray
7 months ago • •

So, @mitsuhiko is working on using LLMs to process XML Except for, the models can’t write legal XML. So he’s using the model to generate a sloppy-XML parser: lucumr.pocoo.org/2025/6/21/my-…

OK, my mind is now made up about vibe coding. I’m saying take the ship up and nuke the site from orbit. It’s the only way to be sure.

#genAI

My First Open Source AI Generated Library

In a first for me, I published some agentic programmed AI slop to PyPI.

^{Armin Ronacher's Thoughts and Writings}

#genai @Armin Ronacher

in reply to Tim Bray

Adam Kent

in reply to Tim Bray • 7 months ago • •

Interesting contrast to read this shortly after another somewhat different post about parsers blog.trailofbits.com/2025/06/1…

Unexpected security footguns in Go's parsers

File parsers in Go contain unexpected behaviors that can lead to serious security vulnerabilities. This post examines how JSON, XML, and YAML parsers in Go handle edge cases in ways that have repeatedly resulted in high-impact security issues in prod…

^{The Trail of Bits Blog}

in reply to Tim Bray

Tony Fisk

in reply to Tim Bray • 7 months ago • •

.. do it before the ship controls are updated with sloppy vibe xml

in reply to Tim Bray

Elias Mårtenson

in reply to Tim Bray • 7 months ago • •

I looked at the code. I'm not a python person, but I'm sure the code style itself is perfectly standard and accessible. That's the least I'd expect.

What I'm really concerned about is the fact that the entire thing is based on regexes, which is a bad idea because it's not powerful enough to do it.

This is the programming version of the eating glue suggestions. Perfectly formed sentences, arguments that makes sense taken in isolation, etc. But the output as a whole isn't what you need.

in reply to Elias Mårtenson

Armin Ronacher

in reply to Elias Mårtenson • 7 months ago • •

@loke there is nothing wrong with using regular expressions for parsing.

@Elias Mårtenson

in reply to Armin Ronacher

Elias Mårtenson

in reply to Armin Ronacher • 7 months ago • •

as part of a parser, of course it's fine. Many parsers do that. But for matching entire XML constructs, yes. I.e. you can match a tag, but not an entire CDATA section for example.

in reply to Elias Mårtenson

Armin Ronacher

in reply to Elias Mårtenson • 7 months ago • •

@loke the library implements a parser.

@Elias Mårtenson

in reply to Armin Ronacher

Tim Bray

in reply to Armin Ronacher • 7 months ago • •

@loke I was partly joking, you stepped in a pool of history. In the early days of XML there was massive controversy over whether the parsers should be “Draconian” like JSON or “Tolerant” like HTML. There are probably people out there who are still mad. And in fact it’s hard to think of an application where a “sloppy” parser would be acceptable.

@Elias Mårtenson

in reply to Tim Bray

Elias Mårtenson

in reply to Tim Bray • 7 months ago • •

Right. XML is very strict for a reason.

I have created a "quasi-sloppy" XML parser once, some 20 years ago. I needed to parse the content of an XML file which had been truncated due to the application writing it crashing in the middle. This was a very clear error condition though, and the solution was as simple as overriding some methods in the SAX parser to preserve the current state if an EOF was detected.

I don't really want to know what kind of broken infrastructure would be behind the need for parsing invalid XML.

The "be lenient what you accept, be precise in what you send" principle has been deemed false for a very long time now.

in reply to Tim Bray

Todd Knarr

in reply to Tim Bray • 7 months ago • •

Forget nuke the site, we're well into "clear the system and dump a few nova bombs into the primary" territory.

in reply to Tim Bray

Nick Sloan

in reply to Tim Bray • 7 months ago • •

It’s really disappointing to see really smart people throwing in the towel on quality to do work that isn’t all that interesting, while giving oxygen to the idea that a jumble of stolen IP can effectively replace human workers.

in reply to Tim Bray

felix (grayscale) 🐺

in reply to Tim Bray • 7 months ago • •

my impression of Claude's output is that it's "answer-shaped code". it has the form of ok code, but details are wrong, choices are inconsistent, organization is weird. basically, pretty much like anything else produced by LLMs/diffusion.

I'm now even less inclined to trust LLM coding. eg, there's a unit test that's subtly wrong: it parses "&" and checks the result _contains_ "&", which will pass even if the entity isn't expanded.

uncanny, hard-to-spot errors.

⇧