TL;DR: My new prior is that top-of-the-line LLMs working on easy tasks generate code that is maybe 10 % more complicated than necessary. I also think we accept this complexity too easily, because it comes from code that is right here, right now, solving an immediate problem. This may have consequences for maintenance in the long term.
(The text of the LessWrong version of this article is lightly adjusted to fit a more general audience than my usual readership of software product developers.)
The background to this discovery was that I needed to do some software plumbing in a work project. It was a simple change that mostly mirrored existing functionality. This is a perfect fit for LLMs, in my experience, so I used a frontier model to generate the code for it. The change ended up being a total of just over 200 lines, mostly additions.
The part of the generated code we’ll talk about is a 24-line function that converts an arbitrary (user-supplied) string to a safe HTTP header value.[1]
toHeaderValue :: Text → Text
toHeaderValue raw =
let
attrChars = ”!#$&+-.^_`|~”
padHex t = if Text.length t < 2 then “0” <> t else t
percentEncode c =
if (isAscii c && isAlphaNum c) || elem c attrChars then
Text.singleton c
else
Text.concat
[ “%” <> padHex (Text.toUpper (Text.pack (showHex b “”)))
| b ← ByteString.unpack (encodeUtf8 (Text.singleton c))
]
rfc5987Encode = Text.concatMap percentEncode
isPrintable c = c >= ′ ′ && c /= ‘\DEL’
replacePathSeparator c =
if c == ‘/’ || c == ‘\\’ then
‘_’
else
c
cleaned =
Text.map replacePathSeparator (Text.filter isPrintable raw)
in
rfc5987Encode cleaned
When looking at this function in isolation, it obviously seems a bit too complicated, but remember that this was just 24 lines in a 200-line change. I confirmed that the underlying idea was correct, and that the generated tests covered all the edge cases I would want to see covered. It’s not pretty code, but it is proven correct by tests.
More importantly, it is highly local. If anything about this code needs replacing, it can be replaced without touching anything else. Apprentice-level programmers worry equally about code quality everywhere; I’ve long wanted to write an article called “Don’t worry, it’s local” where I tell these programmers that bad code quality is fine, as long as it’s self-contained in a small location.
I accepted this code. I needed the implementation to work, and this code obviously worked. It was right there, right now. It would have been silly to not accept it! Accepting it was the easy choice, and certainly not a bad decision.
However, in a pleasant twist of fate, the automated code verification pipeline for this project has a mandatory statement test coverage check, and that check failed for this code.
The check failed due to the padHex function, which takes a hexadecimal value in the range 0x0–0xff and zero-pads it if it is less than 0x10. The data passed into padHex has already gone through the isPrintable filter, which removes all bytes lower than 0x20. Thus no value passed to padHex is ever below 0x10, and it never ends up padding anything! It is always a no-op. The statement coverage check warns on the padding branch of padHex, because it is exercised by no automated test. It is in fact impossible to exercise it in a test.
This was annoying:
On the one hand, we shouldn’t assume
percentEncodeis always called with characters greater than0x1f, even if that happens to be true at the moment. Such an assumption relies on spooky action at a distance, which – even if it is local to this function – we want to avoid.On the other hand, the coverage report is right too: there is something awkward about this whole construction.
So I stepped in and wrote my own implementation. The implementation that ended up shipping was closer to this:
toHeaderValue :: Text → Text
toHeaderValue =
let
retainPrintable = Text.filter (\c → c >= ′ ′ && c /= ‘\DEL’)
replacePathSeparators = Text.replace ”/” “_” . Text.replace “\\” “_”
-- URL encoding is also legal RFC5987 encoding.
rfc5987Encode = decodeUtf8 . urlEncode True . encodeUtf8
in
rfc5987Encode . replacePathSeparators . retainPrintable
This is 15 lines of complexity shorter. That’s around 8 % of the change.
The LLM did not generate bad code.[2] It just generated code that was at least 8 % more complex than it needed to be. That’s not a disaster today, and when there’s pressure to ship, it is easy to accept it because it is right there, right now, and it solves the problem. I accepted and was about to ship code that was 8 % too complex. It was only by chance I looked into it more deeply and realised the problems with it.
This experience leaves me with a bunch of questions I don’t have answers to.
What about all the other changes that are also unnecessarily complex, but which I accept anyway?
What if this was an easy case, and when we sic an LLM on a more complicated task, it generates code that is more than 8 % too complex, like 20 %, or 40 %, or even 3× more complex than it needs to be?
Will we put our foots down when we get code that is so unnecessarily complex? Or will we accept, because it’s not a disaster today, and it is right there, right now?
What happens in a year or two, when we continue shipping code that’s consistently more complex than it needs to be?
On the one hand, this worries me. On the other hand, the obvious counter-argument is that code-generating robots improve fast enough that in two years’ time when this becomes a problem, they will know how to deal with it.
Maybe. I’m not convinced.
I see extra complexity every time I try to get coding help.
For background, I’m not much of a programmer. The biggest project I’ve made is a ~1000 line script that almost runs in a straight line from start to finish. It’s enough to turn a week of constant work into five minutes of careful checking, or a few minutes of skilled labor into a single button press, but nothing compared to True Software(TM).
Whenever I would look up how to do something (open a file, create a folder, play a sound), I’d find a 20-line monstrosity given as a minimal example on Stack Overflow, or something worse on Copilot (I’ve since graduated to Claude). After picking apart which sections actually matter, I’d add the two important lines to my code and go on to the next problem.
If LLM code was only 10% more complex than necessary, they’d be obviously superhuman.
I had a similar experience trying to find something useful about Java. Some things are used together so often that I suspect most people don’t even know how to use them separately. You ask about one thing, and their shortest example is 500 lines of project setup, dependency injection, and whatever else, where 5 lines would perfectly address your question.
It’s like asking “how can I change a light bulb”, and getting answers like “here is how to renovate your entire house (includes replacing all the light bulbs)”.
Makes me suspect that most people do not understand their craft as deeply as I would prefer. IT development used to be a job for people who were obsessed with technical details; now there are many people who are happy just to put something things together and collect a salary.
That said, in my experience AIs are a huge improvement over Stack Overflow, because you can ask them additional questions and often get the answer you want (where SO would be like “just accept the correct answer, this is not a place for discussion”).
My guess is that there are automated ways that will help with e.g. 90% (or even all) cases like this:
Just asking the model “see this patch, try making it cleaner and more concise if possible, while keeping all the important logic” would likely help in your case.
You could also have some “critic” role. The insight “this looks to long and complicated on the first skim” is something an LLM could also say here, and then you could ask the model to improve that part.
Generally LLMs are really good at refactoring, but it feels people don’t use them for that purpose enough because that costs time and tokens. But I don’t see a good reason for why it would stay that way forever.
So, in other words, I would predict that with the current LLMs you could have “high quality code” scaffold that produces high quality code, just at a cost.
In the traditional software lifecycle, this does not look sustainable.
It seems that the reason agentic coding sort of works today is that LLMs make “refactoring by rewriting/regenerating from scratch” affordable, and when people regenerate from scratch, one avoids one traditional source of problems (accumulations of defects on top of defects on top of defects on top of … ), hence the LLM complexity overhead remains bounded rather than increasing in an unbounded fashion and sinking the project.
Of course, in a year or two people expect to benefit from better coding agents, with better taste and less propensity for unnecessary complexity (and so they do expect that eventually this practice of “refactoring by regenerating from scratch” will wash the unnecessary complexity away as agents become better).
i think that we need to start rethinking some of our best-practices. For instance, does DRY matter? If i have a server that serves a bunch of endpoints, and if each one reimplements some fetching function, is that bad? As you said, if all of the code is local, and we need to fix one of the endpoints, we can just fix that endpoint, and not break anything else in the process. If we need to make changes across the endpoints, we can tell it to do so.
I’m not sure about this though. Right now, I don’t think that’s the right call—mostly because humans still have to read and understand the code—but I expect that in the very near future we won’t be reading much of the code anyway.
it’s a weird time we’re living in right now as software engineers.