Today, I needed to work through substantial project with a lot of drudgery (checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. This involved a mix of things which are easy to do programmatically and things which require intelligent judgement, and has a fairly objective desired artifact (a list of all the places where state could leak, and a failing functional test demonstrating that leakage for each one).
I decided to do the John Henry thing—I set up Claude Code (in a container with --dangerously-skip-permissions) in one worktree with a detailed description of the project and the task, and then in a separate worktree I set off with my favorite text editor and without the help of any AI tooling more advanced than Copilot.
I finished about 4 hours later, despite fairly frequent interruptions to provide clarification and further instructions to Claude. Claude is now reaching the 7 hour / 100M token mark and has still not finished, though it has multiple times now declared that it has succeeded at the task and that the codebase is safe for this migration (it’s not).
I’m honestly pretty shocked, because this task seemed like a pretty much perfect fit for a coding agent, and is one that doesn’t require all that much codebase-specific context. I went into this expecting to lose—I was trying to quantify how much coding agents can help with obnoxious repetitive maintenance tasks, thus allowing maintenance work which might otherwise have been deferred to happen at lower cost. But I guess that’s not the post I’m writing today (which is a bummer, I had a whole outline planned out and everything).
Likely this situation will change by next year, but for now I suppose the machine cannot yet replace even the more repetitive parts of my job. Perhaps things are different in ML land but I kind of doubt it.
I applaud these very specific AI capability tests by individuals, wish more people would post these, especially with official benchmarks being so unreliable nowadays. (Like, here is my concrete project with this very specific task I actually needed done, this is how long I estimate it would take me, and this is how the AI spectacularly succeeded / spectacularly failed.)
I never raced the AI like this in real time, maybe I should try sometime. (My impression so far has been that it can either do a task, and then it’s much faster me, or it cannot, no matter how much time it’s given.)
In the spirit of posting more on-the-ground impressions of capability: in my fairly simple front-end coding job, I’ve gone in the past year from writing maybe 50% of my code with AI to maybe 90%.
My job the past couple of months has been this: attending meetings to work out project requirements, breaking those requirements into a more specific sequence of tasks for the AI- often just three or four prompts with a couple of paragraphs of explanation each- then running through those in Cursor, reviewing the changes and making usually pretty minor edits, then testing- which almost never reveals errors introduced by the AI itself in recent weeks- and finally pushing out the code to repos.
Most of the edits I make have to do with the models’ reluctance to delete code- so, for example, if a block of code in function A needs to be moved into its own function so that functions B and C can call it, the AI will often just repeat the code block in B and C so that it doesn’t have to delete anything in A. It also sometimes comes up with strange excuses to avoid deleting code that’s become superfluous.
The models also occasionally have an issue where they’ll add fallbacks to prevent functions from returning an error even when they really should return an error, such as when a critical API call returns bad data.
So, in a way, the main bottleneck to the AI doing everything one-shot at this point seems to be alignment rather than capability- the models were trained to avoid errors and avoid deleting code, and they care more about those than producing good codebases. Though, that said, these issues almost never actually produce bugs, and dealing with them is arguably more stylistic than functional.
In my department, I think all of the other developers are using AI in the same way- judging by how the style of the code they’ve been deploying has changed recently- but nobody talks about it. It’s treated almost like an embarrassing open secret, like people watching YouTube videos while on the clock, and I think everyone’s afraid that if the project managers ever get a clear picture of how much the developers are acting like PMs for AI, the business will start cutting jobs.
the AI will often just repeat the code block in B and C so that it doesn’t have to delete anything in A
Some human devs do this too. In the short term it reduces the likelihood of breaking things because something you weren’t aware of relied on the old version. In the long term it makes changes harder, because now if you want to change the logic instead of changing it in one place you have to change it in n similar but usually not identical places, and if those places are different in ways that affect the implementation, now you have to try to make an informed guess about whether they’re different on purpose and if so why. Down that path lies madness.
I’ve gone in the past year from writing maybe 50% of my code with AI to maybe 90%.
I’m curious what fraction of your non-boilerplate, non-test code that ends up in production is AI-generated. Do you review it manually?
At this point probably >95% of the code I cause to be written is AI-generated. Most of the AI-generated code is exploratory[1] or rote[2], though. About 75% of the code I merge is AI-generated, but most of that is either boilerplate or tests[3]. and only 20% or so of the non-boilerplate, non-test code that makes it onto prod.
In any case, I should probably make a top level shortform to this effect, since this one got a lot more engagement than I was expecting—it was intended to be “I tried to get a measurement of how much AI can help me with maintenance work, and the attempt failed in an entertaining way” with a side of “I don’t think I’m going to be substantially replaced by clopus just yet”, but I have a bad feeling people are over-updating to “LLMs don’t help with programming”, which is not my experience at all.
e.g. mocks of what a flow could look like, comparing lots of different alerting thresholds against historical data to see how I want to configure alarms
DI wiring, includes, docblocks, that sort of thing. Basically the stuff I had keyboard shortcuts to fill in for me in 5 keystrokes in the days before AI.
“write tests” is, by far, the area I get the most value out of current LLM coding agents. I rarely have the time or energy to write the test suite I really want by hand, but I do have enough time to rattle off a few dozen “when x happens then y happens z should be true” style things. LLM coding agents can usually come up with more. Test code is also very tolerant of copy/paste/modify (so I’m willing to say “look at this example, and copy shamelessly from it to fit your needs), and is also much more tolerant of bad code than user-facing code is (since rewrites are low-risk and will generally break in obvious ways if they break. Between these factors, I am usually quite happy to ship LLM-written tests)
I never raced the AI like this in real time, maybe I should try sometime.
I strongly recommend you do. I expect you will have fun doing it, and that you will grow as a developer by doing so whether or not the AI beats you or even succeeds at the task. Even if the AI fails, it will likely use different tools than the ones you would have used, so you’ll likely pick up new tricks. Having an AI to race against is also pretty great for staying focused and not getting sucked into rabbit holes—and it also is great for helping you determine after the fact whether a given rabbit hole was necessary to go down (if the AI didn’t go down the rabbit hole and successfully completed the task, the rabbit hole was not necessary).
Maybe this is partially because the AI is biased towards finishing its task quickly? During training, it would eventually be cut off if it took too long, so it’s motivated to stop early and assert that it’s finished even when that isn’t true.
Was it definitely necessary to give it a lot of hints about what to do, or do you think it could’ve succeeded if you just repeatedly said “you missed something, try again”? If hints were really needed, what kinds of things did it need hints for?
I wonder if it would help to break down the task into multiple pieces. You could try:
Spin up a separate CC instance for each subdirectory in the codebase
Ask one agent to come up with a list of things to change, then have a separate instance implement each change
Just repeatedly run a prompt to “find 1-3 things to fix, then fix them,” resetting each time (this is similar to the “Ralph Wiggum method”)
For more context, the project in question was a fairly standard laravel (php) project which makes heavy use of dependency injection, which I’m looking at serving with swoole rather than apache. This involves moving from a shared-nothing architecture, where the framework is spun up in a fresh state for every request, to one where some things can persist from request to request. What I asked for was an echaustivd inventory of all places where this could lead to data leakage from one request to the next, with a failing functional test demonstrating the leakage in as many places as viable, and . I provided an example of a place with such a leakage (a container-bound singleton with persistent state), and instructions for how to run the test, and also an example of how to run arbitrary code within the framework context, and some examples of previously-built linting scripts which used reflection and ast parsing to identify problematic patterns programmatically.
Claude correctly identified that the concrete places one might find such state persisting across requests (global variables, static class variables, singleton services bound to the container, dependencies of said singletons, stateful database connections).
Initially Claude tried to identify potentially problematic patterns it could grep for, which turned up some examples. It then wrote two functional tests (which I later looked at and noticed were basically equivalent to assertFalse(true)), ran them to verify that they were failing, produced a report describing two “CRITICAL” issues with static class variables, declared that the application did not contain any instances of container-bound singletons with persistent state or state that could persist on a database connection between requests, and declared the task completed.
I told it that the task was not finished, and that I wanted an exhaustive list and that this would necessarily involve writing and executing code within the framework context, and, again, here are some examples of how to do this. I also flagged that its search for container-bound singletons with persistent state should have, at a minimum, caught the example of state leakage that I had given as an example.
Claude then proceeded to write a plausible-looking script to find examples of container-bound singletons, ran it, saved the results to a CSV, wrote a different plausible-looking script to find examples of services with non-DI instance variables, saved those results to a different csv, then used the csvjoin tool (which was cool, I didn’t realize that tool existed) to merge the two results into a third csv. That third CSV was empty, and Claude again declared that there were in fact no instances of that pattern and that it had successfully completed the task.
I mentioned, again, that it should at a minimum be turning up the example from the initial prompt. Claude went and spun its wheels for quite a while, and came back with several new plausible scripts (by this time it had dropped over 30 different scripts / md files / csvs in the repository root, with no organizational schema to speak of). After quite a few more rounds of this (Claude would run the tools but not sanity-check the outputs, so I had to sanity-check them to point to which tool wasn’t working right, at which point Claude would write an entirely new version of that tool in the repository root which wrote to a different file), Claude finally had a script which produced a complete list of the few hundred places which could have this pattern, and which would need to be checked individually to see what was happening and if there were any trends that would allow for pruning that list down further in a programmatic way. This was the point at which I decided to call it a day yesterday.
I might resume on Monday, but given how the beginning of the task went, I don’t have a lot of hope that Claude Code can finish the task at all.
To me the bottleneck mostly seemed to be that clopus trusted the tools it wrote way too much. When I write a tool like this, I tend to start with an output I know the tool will produce if it’s working correctly and a vague sense for how much output the tool will produce, write the first version of the tool, run it, and check the output to make sure it contains the thing I thought it should contain. Claude, on the other hand, seems to follow the algorithm “write tool. run tool. accept results of tool”.
Additionally, the product I want at the end of this is an exhaustive list of places to check, and a repeatable way to generate that list. Claude’s strength seems to be that it is unreasonably good at one-shotting things—once it has to consider the current outputs of a tool it wrote, the outputs it wants from that tool, and the way it needs to change that tool to get those new outputs, things seem to fall apart.
It looks like Ralph Wiggum is intended for a different use case (“Claude code is successfully completing tasks, but stopping before it finishes with all of them”).
I asked Claude Code to make a Slither Link puzzle game, providing it an extensive design doc about stuff like difficulty curve or UI but none about the basic game rules, and it failed to get puzzle generation to work (IIRC initially it wouldn’t even generate closed loops or something) and then got stuck trying to fix it, continually giving me new versions that never worked nor even looked like they got any closer to the goal. To be clear, that doesn’t meant that Claude Code can’t complete this particular task, it just means that I couldn’t get it to Just Work™.
I would strongly recommend asking Claude to set up a deterministic system to do this. Depending on precisely how flexible you need to be, Semgrep might be the correct tool for the job, and Claude should be able to configure that. It may be able to suggest other deterministic approaches using tools I don’t know about.
In my experience, semgrep does not play well with trying to find cross-class behavior in dynamically typed codebases with lots of dependency injection, which is why I was trying to make Claude write some code which combined static analysis (in the form of reflection or ast parsing) with runtime logic for gathering information which is hard to determine statically but easy to determine at runtime.
For reference the code I ended up writing for this part was about 40 lines, it wasn’t very complicated. Trying to do it in full generality purely by static analysis would be insanely complex (because php has terrible constructs like $$foo = "bar" and $foo->$bar = "baz", which this codebase doesn’t use and can be trivially verified not to use, but which would be a nightmare to handle if they were used), but fortunately this wasn’t what I needed.
But yeah, I also expected Claude to be able to do this trivially. It is able to trivially do most tasks which feel, to me, to be about this difficult or even a bit more difficult. This task felt like it should have been easier, since it’s one where there’s a lot of available signal to self-correct if you make a mistake, much more so than for many of the “build and test a feature” style tasks that Claude regularly does with no drama. Which is why I thought it would be a good example for a post along the lines of “many people use LLMs to quickly add sloppy features to their codebase, increasing technical debt, but it’s also possible to use them to resolve technical debt much faster than doing it by hand”. And then I tried it.
Today, I needed to work through substantial project with a lot of drudgery (checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. This involved a mix of things which are easy to do programmatically and things which require intelligent judgement, and has a fairly objective desired artifact (a list of all the places where state could leak, and a failing functional test demonstrating that leakage for each one).
I decided to do the John Henry thing—I set up Claude Code (in a container with
--dangerously-skip-permissions) in one worktree with a detailed description of the project and the task, and then in a separate worktree I set off with my favorite text editor and without the help of any AI tooling more advanced than Copilot.I finished about 4 hours later, despite fairly frequent interruptions to provide clarification and further instructions to Claude. Claude is now reaching the 7 hour / 100M token mark and has still not finished, though it has multiple times now declared that it has succeeded at the task and that the codebase is safe for this migration (it’s not).
I’m honestly pretty shocked, because this task seemed like a pretty much perfect fit for a coding agent, and is one that doesn’t require all that much codebase-specific context. I went into this expecting to lose—I was trying to quantify how much coding agents can help with obnoxious repetitive maintenance tasks, thus allowing maintenance work which might otherwise have been deferred to happen at lower cost. But I guess that’s not the post I’m writing today (which is a bummer, I had a whole outline planned out and everything).
Likely this situation will change by next year, but for now I suppose the machine cannot yet replace even the more repetitive parts of my job. Perhaps things are different in ML land but I kind of doubt it.
I applaud these very specific AI capability tests by individuals, wish more people would post these, especially with official benchmarks being so unreliable nowadays. (Like, here is my concrete project with this very specific task I actually needed done, this is how long I estimate it would take me, and this is how the AI spectacularly succeeded / spectacularly failed.)
I never raced the AI like this in real time, maybe I should try sometime. (My impression so far has been that it can either do a task, and then it’s much faster me, or it cannot, no matter how much time it’s given.)
In the spirit of posting more on-the-ground impressions of capability: in my fairly simple front-end coding job, I’ve gone in the past year from writing maybe 50% of my code with AI to maybe 90%.
My job the past couple of months has been this: attending meetings to work out project requirements, breaking those requirements into a more specific sequence of tasks for the AI- often just three or four prompts with a couple of paragraphs of explanation each- then running through those in Cursor, reviewing the changes and making usually pretty minor edits, then testing- which almost never reveals errors introduced by the AI itself in recent weeks- and finally pushing out the code to repos.
Most of the edits I make have to do with the models’ reluctance to delete code- so, for example, if a block of code in function A needs to be moved into its own function so that functions B and C can call it, the AI will often just repeat the code block in B and C so that it doesn’t have to delete anything in A. It also sometimes comes up with strange excuses to avoid deleting code that’s become superfluous.
The models also occasionally have an issue where they’ll add fallbacks to prevent functions from returning an error even when they really should return an error, such as when a critical API call returns bad data.
So, in a way, the main bottleneck to the AI doing everything one-shot at this point seems to be alignment rather than capability- the models were trained to avoid errors and avoid deleting code, and they care more about those than producing good codebases. Though, that said, these issues almost never actually produce bugs, and dealing with them is arguably more stylistic than functional.
In my department, I think all of the other developers are using AI in the same way- judging by how the style of the code they’ve been deploying has changed recently- but nobody talks about it. It’s treated almost like an embarrassing open secret, like people watching YouTube videos while on the clock, and I think everyone’s afraid that if the project managers ever get a clear picture of how much the developers are acting like PMs for AI, the business will start cutting jobs.
Some human devs do this too. In the short term it reduces the likelihood of breaking things because something you weren’t aware of relied on the old version. In the long term it makes changes harder, because now if you want to change the logic instead of changing it in one place you have to change it in n similar but usually not identical places, and if those places are different in ways that affect the implementation, now you have to try to make an informed guess about whether they’re different on purpose and if so why. Down that path lies madness.
I’m curious what fraction of your non-boilerplate, non-test code that ends up in production is AI-generated. Do you review it manually?
At this point probably >95% of the code I cause to be written is AI-generated. Most of the AI-generated code is exploratory [1] or rote [2] , though. About 75% of the code I merge is AI-generated, but most of that is either boilerplate or tests [3] . and only 20% or so of the non-boilerplate, non-test code that makes it onto prod.
In any case, I should probably make a top level shortform to this effect, since this one got a lot more engagement than I was expecting—it was intended to be “I tried to get a measurement of how much AI can help me with maintenance work, and the attempt failed in an entertaining way” with a side of “I don’t think I’m going to be substantially replaced by clopus just yet”, but I have a bad feeling people are over-updating to “LLMs don’t help with programming”, which is not my experience at all.
e.g. mocks of what a flow could look like, comparing lots of different alerting thresholds against historical data to see how I want to configure alarms
DI wiring, includes, docblocks, that sort of thing. Basically the stuff I had keyboard shortcuts to fill in for me in 5 keystrokes in the days before AI.
“write tests” is, by far, the area I get the most value out of current LLM coding agents. I rarely have the time or energy to write the test suite I really want by hand, but I do have enough time to rattle off a few dozen “when x happens then y happens z should be true” style things. LLM coding agents can usually come up with more. Test code is also very tolerant of copy/paste/modify (so I’m willing to say “look at this example, and copy shamelessly from it to fit your needs), and is also much more tolerant of bad code than user-facing code is (since rewrites are low-risk and will generally break in obvious ways if they break. Between these factors, I am usually quite happy to ship LLM-written tests)
I strongly recommend you do. I expect you will have fun doing it, and that you will grow as a developer by doing so whether or not the AI beats you or even succeeds at the task. Even if the AI fails, it will likely use different tools than the ones you would have used, so you’ll likely pick up new tricks. Having an AI to race against is also pretty great for staying focused and not getting sucked into rabbit holes—and it also is great for helping you determine after the fact whether a given rabbit hole was necessary to go down (if the AI didn’t go down the rabbit hole and successfully completed the task, the rabbit hole was not necessary).
Maybe this is partially because the AI is biased towards finishing its task quickly? During training, it would eventually be cut off if it took too long, so it’s motivated to stop early and assert that it’s finished even when that isn’t true.
Was it definitely necessary to give it a lot of hints about what to do, or do you think it could’ve succeeded if you just repeatedly said “you missed something, try again”? If hints were really needed, what kinds of things did it need hints for?
I wonder if it would help to break down the task into multiple pieces. You could try:
Spin up a separate CC instance for each subdirectory in the codebase
Ask one agent to come up with a list of things to change, then have a separate instance implement each change
Just repeatedly run a prompt to “find 1-3 things to fix, then fix them,” resetting each time (this is similar to the “Ralph Wiggum method”)
For more context, the project in question was a fairly standard laravel (php) project which makes heavy use of dependency injection, which I’m looking at serving with swoole rather than apache. This involves moving from a shared-nothing architecture, where the framework is spun up in a fresh state for every request, to one where some things can persist from request to request. What I asked for was an echaustivd inventory of all places where this could lead to data leakage from one request to the next, with a failing functional test demonstrating the leakage in as many places as viable, and . I provided an example of a place with such a leakage (a container-bound singleton with persistent state), and instructions for how to run the test, and also an example of how to run arbitrary code within the framework context, and some examples of previously-built linting scripts which used reflection and ast parsing to identify problematic patterns programmatically.
Claude correctly identified that the concrete places one might find such state persisting across requests (global variables, static class variables, singleton services bound to the container, dependencies of said singletons, stateful database connections).
Initially Claude tried to identify potentially problematic patterns it could grep for, which turned up some examples. It then wrote two functional tests (which I later looked at and noticed were basically equivalent to
assertFalse(true)), ran them to verify that they were failing, produced a report describing two “CRITICAL” issues with static class variables, declared that the application did not contain any instances of container-bound singletons with persistent state or state that could persist on a database connection between requests, and declared the task completed.I told it that the task was not finished, and that I wanted an exhaustive list and that this would necessarily involve writing and executing code within the framework context, and, again, here are some examples of how to do this. I also flagged that its search for container-bound singletons with persistent state should have, at a minimum, caught the example of state leakage that I had given as an example.
Claude then proceeded to write a plausible-looking script to find examples of container-bound singletons, ran it, saved the results to a CSV, wrote a different plausible-looking script to find examples of services with non-DI instance variables, saved those results to a different csv, then used the
csvjointool (which was cool, I didn’t realize that tool existed) to merge the two results into a third csv. That third CSV was empty, and Claude again declared that there were in fact no instances of that pattern and that it had successfully completed the task.I mentioned, again, that it should at a minimum be turning up the example from the initial prompt. Claude went and spun its wheels for quite a while, and came back with several new plausible scripts (by this time it had dropped over 30 different scripts / md files / csvs in the repository root, with no organizational schema to speak of). After quite a few more rounds of this (Claude would run the tools but not sanity-check the outputs, so I had to sanity-check them to point to which tool wasn’t working right, at which point Claude would write an entirely new version of that tool in the repository root which wrote to a different file), Claude finally had a script which produced a complete list of the few hundred places which could have this pattern, and which would need to be checked individually to see what was happening and if there were any trends that would allow for pruning that list down further in a programmatic way. This was the point at which I decided to call it a day yesterday.
I might resume on Monday, but given how the beginning of the task went, I don’t have a lot of hope that Claude Code can finish the task at all.
To me the bottleneck mostly seemed to be that clopus trusted the tools it wrote way too much. When I write a tool like this, I tend to start with an output I know the tool will produce if it’s working correctly and a vague sense for how much output the tool will produce, write the first version of the tool, run it, and check the output to make sure it contains the thing I thought it should contain. Claude, on the other hand, seems to follow the algorithm “write tool. run tool. accept results of tool”.
Additionally, the product I want at the end of this is an exhaustive list of places to check, and a repeatable way to generate that list. Claude’s strength seems to be that it is unreasonably good at one-shotting things—once it has to consider the current outputs of a tool it wrote, the outputs it wants from that tool, and the way it needs to change that tool to get those new outputs, things seem to fall apart.
It looks like Ralph Wiggum is intended for a different use case (“Claude code is successfully completing tasks, but stopping before it finishes with all of them”).
I asked Claude Code to make a Slither Link puzzle game, providing it an extensive design doc about stuff like difficulty curve or UI but none about the basic game rules, and it failed to get puzzle generation to work (IIRC initially it wouldn’t even generate closed loops or something) and then got stuck trying to fix it, continually giving me new versions that never worked nor even looked like they got any closer to the goal. To be clear, that doesn’t meant that Claude Code can’t complete this particular task, it just means that I couldn’t get it to Just Work™.
I would strongly recommend asking Claude to set up a deterministic system to do this. Depending on precisely how flexible you need to be, Semgrep might be the correct tool for the job, and Claude should be able to configure that. It may be able to suggest other deterministic approaches using tools I don’t know about.
In my experience, semgrep does not play well with trying to find cross-class behavior in dynamically typed codebases with lots of dependency injection, which is why I was trying to make Claude write some code which combined static analysis (in the form of reflection or ast parsing) with runtime logic for gathering information which is hard to determine statically but easy to determine at runtime.
For reference the code I ended up writing for this part was about 40 lines, it wasn’t very complicated. Trying to do it in full generality purely by static analysis would be insanely complex (because php has terrible constructs like
$$foo = "bar"and$foo->$bar = "baz", which this codebase doesn’t use and can be trivially verified not to use, but which would be a nightmare to handle if they were used), but fortunately this wasn’t what I needed.But yeah, I also expected Claude to be able to do this trivially. It is able to trivially do most tasks which feel, to me, to be about this difficult or even a bit more difficult. This task felt like it should have been easier, since it’s one where there’s a lot of available signal to self-correct if you make a mistake, much more so than for many of the “build and test a feature” style tasks that Claude regularly does with no drama. Which is why I thought it would be a good example for a post along the lines of “many people use LLMs to quickly add sloppy features to their codebase, increasing technical debt, but it’s also possible to use them to resolve technical debt much faster than doing it by hand”. And then I tried it.