Follow-up to this post, wherein I was shocked to find that Claude Code failed to do a low-context task which took me 4 hours and involved some skills I expected it would have significant advantages[1].
I kept going to see if Claude Code could eventually succeed. What happened instead was that it built a very impressive-looking 4000 LOC system to extract type and dependency injection information for my entire codebase and dump it into a sqlite database.
To my shock this the tool Claude built[2] actually worked. I ended up playing with the system Claude built for two days, uncovering and ticketing all sorts of bugs in the codebase I hadn’t been aware of. And then I realized that the bugs I was uncovering weren’t of the type that I was actually looking for for the task I was immediately trying to do, and that if I wanted bugs to fix we already have a backlog that we’re not going to get through any time soon no matter how much AI help we have.
So anyway, Claude was able to do a reasonable job of figuring out what endpoint sequences could cause an issue. It struggled to figure out how to invoke the framework to make a mock HTTP request[3], but once it had a template to work off of, it was able to make good progress.
On what I expected to be the hardest part of the task, Claude actually did quite well once it had a template to work off of. It was able (with some re-prompting necessary when it declared the task finished early) to write successfully failing tests for 5 of the 7 cases I had successfully written a test for, as well as one of the four “I am pretty sure this is an issue but I can’t figure out how to expose it” cases. I also learned a few tricks of my own in seeing how Claude tackled a couple of the cases.
All that said, I definitely came away from this experiment with a strong intuition for exactly how it could take 20% longer to do things when you have LLM coding agents assisting you.
The specific task was programmatically checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. The codebase in question makes heavy use of dependency injection and nonzero use of singletons, so the main information-gathering step of the task was
Identify all singletons
Identify all dependencies injected into those singletons, recursively
Identify which of those dependencies had mutable state associated with them, where that mutable state is not currently persisted from request to request
Figure out what real endpoints exercise the code paths to write and read that mutable state in the singleton or one of it’s dependencies.
Figure out a sequence of endpoint calls which would behave differently if the singletons were torn down and rebuilt between vs if they weren’t.
Write a proof-of-concept test which runs that sequence with and without isolation, and demonstrates that either the user sees different information, or different information is persisted in the database
It turns out “Claude built” is somewhat of an overstatement—the core of the system was a thin wrapper over phpstan, an existing static analysis tool. Basically, Claude wrote a custom rule to gather type information at every node, and a custom collector to write the gathered type information to its database. To Claude’s credit, I didn’t even know that that was a supported use case for that tool, and it genuinely is good to know.
Somewhat surprisingly, since there are lots of examples of each of the two main pieces of that puzzle online, though I don’t think anyone has consolidated them into one coherent place, and also the framework code that does that is fairly clearly written with minimal magic.
The METR time horizon is for fully autonomous execution of tasks. I’d expect giving the model hints when it gets stuck to help substantially with that, and for other tasks I do observe that that approach does seem to work. But the one time I tried to actually measure and quantify it, this happened.
The actual part Claude got stuck on was the part which looked like a leeetcode medium problem with a slight twist, not the part that requires actually understanding the application-specific logic. If it had gotten stuck on “write regression tests (as in fact it did once the initial hurdle was cleared), that would not have been surprising.
Like, it does make sense that “a 50% success rate at 4 hour tasks” looks like “approximately 100% success rate at most constituent 30 minute subtasks combined with occasional ~0% success rate at rare subtasks that usually don’t come up in a 4 hour task” rather than “a uniform 92% success rate at each 30 minute subtask” but it still feels a little jarring to experience.
All that said, I definitely came away from this experiment with a strong intuition for exactly how it could take 20% longer to do things when you have LLM coding agents assisting you.
Could you elaborate, or does it boil down to “Helping Claude would have taking 2 days, and doing it on your own would have been faster”? I would be keen for patterns that help me distinguish between
I am making good progress with Claude, and would be slower alone
Claude is slowing me down right now and I should pivot to doing the task myself
Neither of those. It’s “Claude generated an extremely shiny toy in the process of attempting to solve my problem. Playing with that toy felt like productive work, and so I spent a substantial amount of time playing with that toy and LARPing at being productive rather than doing what I was originally trying to do.”
Problem exists between keyboard and chair, as the saying goes.
I’ve had about ~3000 sessions across Claude Code and Codex, and wanted to write about ~8 of the more interesting stories from that experience, but I’m probably not going to prioritize that anytime in the near future sadly.
Follow-up to this post, wherein I was shocked to find that Claude Code failed to do a low-context task which took me 4 hours and involved some skills I expected it would have significant advantages [1] .
I kept going to see if Claude Code could eventually succeed. What happened instead was that it built a very impressive-looking 4000 LOC system to extract type and dependency injection information for my entire codebase and dump it into a sqlite database.
To my shock this the tool Claude built [2] actually worked. I ended up playing with the system Claude built for two days, uncovering and ticketing all sorts of bugs in the codebase I hadn’t been aware of. And then I realized that the bugs I was uncovering weren’t of the type that I was actually looking for for the task I was immediately trying to do, and that if I wanted bugs to fix we already have a backlog that we’re not going to get through any time soon no matter how much AI help we have.
So anyway, Claude was able to do a reasonable job of figuring out what endpoint sequences could cause an issue. It struggled to figure out how to invoke the framework to make a mock HTTP request [3] , but once it had a template to work off of, it was able to make good progress.
On what I expected to be the hardest part of the task, Claude actually did quite well once it had a template to work off of. It was able (with some re-prompting necessary when it declared the task finished early) to write successfully failing tests for 5 of the 7 cases I had successfully written a test for, as well as one of the four “I am pretty sure this is an issue but I can’t figure out how to expose it” cases. I also learned a few tricks of my own in seeing how Claude tackled a couple of the cases.
All that said, I definitely came away from this experiment with a strong intuition for exactly how it could take 20% longer to do things when you have LLM coding agents assisting you.
The specific task was programmatically checking through an entire 1M+ LOC codebase for an http api for patterns which could cause state leakage between requests if we made a specific change to the request handling infrastructure. The codebase in question makes heavy use of dependency injection and nonzero use of singletons, so the main information-gathering step of the task was
Identify all singletons
Identify all dependencies injected into those singletons, recursively
Identify which of those dependencies had mutable state associated with them, where that mutable state is not currently persisted from request to request
Figure out what real endpoints exercise the code paths to write and read that mutable state in the singleton or one of it’s dependencies.
Figure out a sequence of endpoint calls which would behave differently if the singletons were torn down and rebuilt between vs if they weren’t.
Write a proof-of-concept test which runs that sequence with and without isolation, and demonstrates that either the user sees different information, or different information is persisted in the database
It turns out “Claude built” is somewhat of an overstatement—the core of the system was a thin wrapper over phpstan, an existing static analysis tool. Basically, Claude wrote a custom rule to gather type information at every node, and a custom collector to write the gathered type information to its database. To Claude’s credit, I didn’t even know that that was a supported use case for that tool, and it genuinely is good to know.
Somewhat surprisingly, since there are lots of examples of each of the two main pieces of that puzzle online, though I don’t think anyone has consolidated them into one coherent place, and also the framework code that does that is fairly clearly written with minimal magic.
Why is it so surprising? Although it has many issues, the METR 80% time horizon for Claude Opus 4.5 is 27 mins, with a 95% CI from 7 mins to 86 mins.
Couple reasons:
The METR time horizon is for fully autonomous execution of tasks. I’d expect giving the model hints when it gets stuck to help substantially with that, and for other tasks I do observe that that approach does seem to work. But the one time I tried to actually measure and quantify it, this happened.
The actual part Claude got stuck on was the part which looked like a leeetcode medium problem with a slight twist, not the part that requires actually understanding the application-specific logic. If it had gotten stuck on “write regression tests (as in fact it did once the initial hurdle was cleared), that would not have been surprising.
Like, it does make sense that “a 50% success rate at 4 hour tasks” looks like “approximately 100% success rate at most constituent 30 minute subtasks combined with occasional ~0% success rate at rare subtasks that usually don’t come up in a 4 hour task” rather than “a uniform 92% success rate at each 30 minute subtask” but it still feels a little jarring to experience.
Could you elaborate, or does it boil down to “Helping Claude would have taking 2 days, and doing it on your own would have been faster”? I would be keen for patterns that help me distinguish between
I am making good progress with Claude, and would be slower alone
Claude is slowing me down right now and I should pivot to doing the task myself
Neither of those. It’s “Claude generated an extremely shiny toy in the process of attempting to solve my problem. Playing with that toy felt like productive work, and so I spent a substantial amount of time playing with that toy and LARPing at being productive rather than doing what I was originally trying to do.”
Problem exists between keyboard and chair, as the saying goes.
I’ve had about ~3000 sessions across Claude Code and Codex, and wanted to write about ~8 of the more interesting stories from that experience, but I’m probably not going to prioritize that anytime in the near future sadly.