Are We In A Coding Overhang?
Andrej Karpathy posted 12 hours ago (emphasis mine):
I’ve never felt this much behind as a programmer. The profession is being dramatically refactored as the bits contributed by the programmer are increasingly sparse and between. I have a sense that I could be 10X more powerful if I just properly string together what has become available over the last ~year and a failure to claim the boost feels decidedly like skill issue. There’s a new programmable layer of abstraction to master (in addition to the usual layers below) involving agents, subagents, their prompts, contexts, memory, modes, permissions, tools, plugins, skills, hooks, MCP, LSP, slash commands, workflows, IDE integrations, and a need to build an all-encompassing mental model for strengths and pitfalls of fundamentally stochastic, fallible, unintelligible and changing entities suddenly intermingled with what used to be good old fashioned engineering. Clearly some powerful alien tool was handed around except it comes with no manual and everyone has to figure out how to hold it and operate it, while the resulting magnitude 9 earthquake is rocking the profession. Roll up your sleeves to not fall behind.
This seems to be a big update since his Dwarkesh episode published on Oct 17 (though I know these things can take a while to get edited, so the gap could be even bigger), where he said:
Overall, the models are not there. I feel like the industry is making too big of a jump and is trying to pretend like this is amazing, and it’s not. It’s slop. They’re not coming to terms with it, and maybe they’re trying to fundraise or something like that. I’m not sure what’s going on, but we’re at this intermediate stage. The models are amazing. They still need a lot of work. For now, autocomplete is my sweet spot. But sometimes, for some types of code, I will go to an LLM agent.
This is just me guessing, but Claude Opus 4.5 released just one month ago, and Opus 4.5 + Claude Code seems like the big shift for a lot of people.
In fact, Boris Cherny, creator of Claude Code, commented on Karpathy’s post saying (emphasis mine):
I feel this way most weeks tbh. Sometimes I start approaching a problem manually, and have to remind myself “claude can probably do this”. Recently we were debugging a memory leak in Claude Code, and I started approaching it the old fashioned way: connecting a profiler, using the app, pausing the profiler, manually looking through heap allocations. My coworker was looking at the same issue, and just asked Claude to make a heap dump, then read the dump to look for retained objects that probably shouldn’t be there; Claude 1-shotted it and put up a PR. The same thing happens most weeks. In a way, newer coworkers and even new grads that don’t make all sorts of assumptions about what the model can and can’t do — legacy memories formed when using old models — are able to use the model most effectively. It takes significant mental work to re-adjust to what the model can do every month or two, as models continue to become better and better at coding and engineering. The last month was my first month as an engineer that I didn’t open an IDE at all. Opus 4.5 wrote around 200 PRs, every single line. Software engineering is radically changing, and the hardest part even for early adopters and practitioners like us is to continue to re-adjust our expectations. And this is still just the beginning.
To be clear, a lot of these PRs might be “quite small, a few lines and bug fixes” (cf. this comment by another Anthropic employee). Boris had just asked users for feedback, then closed 19 PRs the next morning. Still, 200 PRs in a month without opening an IDE is something [1].
Update: Boris just made another post giving more details (emphasis mine):
[...]
In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed. Every single line was written by Claude Code + Opus 4.5. Claude consistently runs for minutes, hours, and days at a time (using Stop hooks). Software engineering is changing, and we are entering a new period in coding history. And we’re still just getting started..
AI Accelerating AI
It seems like we might be entering something like a self-improving feedback loop for the system “humans + AI”: employees at the labs are developing AI coding agents using these same AI coding agents, with the horizon length of these models increasing on a faster exponential than we thought (cf. Opus 4.5)[2], and potentially not even an exponential.
This isn’t AI autonomously improving itself, but the feedback loop between training better AI models and having these models accelerate the automation of AI R&D seems to be tightening [3] [4].
The “Coding Overhang”
In July 2020, after GPT-3, Andy Jones asked if we were in an AI Overhang, because (at the time) it felt like companies could just be scaling models like GPT-3 to many more orders of magnitude and get much more “intelligence”.
With coding agents and reasoning / test-time compute, it seems to me that what Karpathy (& Boris) are describing is some sort of “Coding Overhang” where people at the cutting edge, and especially members of technical staff, are trying to catch up with ~10x improvements that are purely user-dependent skill-issues.
“I’ve actually been enjoying the last days of software development. There’s a touch of sadness, but there’s also something about knowing we’re near the end that makes it novel and sweet again.”—Moxie Marlinspike (creator of Signal)
In what worlds do we not get Superhuman Coders by the end of 2026?
- ^
Note: As the creator of Claude Code, Boris is obviously incentivized to promote it.
- ^
Update: after seeing Daniel Kokotajlo’s answer and thinking about it more, I should clarify that if things stay roughly on trend, even with a 5 months doubling time, we’d only get to 1-month tasks with 80%-reliability by Apr 2029. See full comment here.
- ^
See this older 2021 post I wrote about self-improving {humans + AI} systems, or this video explaining Tom Davidson’s full takeoff model for more intuitions.
- ^
Boris just wrote on X: “[...] In the last thirty days, 100% of my contributions to Claude Code were written by Claude Code”
A major plausible class of worlds in which we don’t get superhuman coders by end of 2026 is worlds where the METR trend continues at roughly the same or only slightly greater slope than the slope it had in 2025. Right?
Yes, but 2025 saw two trends: Claude 3.5 Sonnet—o3 and o3 -- GPT-5.1CodexMax with different doubling times. IIRC the earlier trend would cause superhuman coders to appear by 2028 and the later trend (which was arguably invalidated by Claude 4.5 Opus and its ~5h time horizon; see, however, two comments pointing out that the METR benchmark is no longer as trustworthy as it once was and my potential explanation of the abnormally high 50%/80% time horizon ratio) had superhuman coders arrive in 2030 or outright hit a wall[1] before becoming superhuman.
As for the OP’s idea that coding agents are used to improving coding agents and reaching the SC, this could be unlikely because they don’t improve the underlying LLM. I remember the now-obsolete benchmarks-and-gaps model which required the SCs not just to saturate the RE-bench, but learn to actually do long tasks and handle complex codebases, which in turn requires either a big attention span of the LLM itself or careful summarisation of each method’s specification, of formatting, of other methods’ names, etc.
P.S. The latter scenario would be particularly difficult to predict as it might involve the time horizon in the METR sense behaving like ectect∞−ect. In this case the horizon would grow ~exponentially until the very last couple of doublings.
Or become neuralese with consequences as disastrous as the lack of Safer-1 to test alignment.
When looking for trend breaks in time series, it’s unwise to rely on eyeballing when Quandt likelihood ratio test aka sup-Wald test exists for 65 years (google it or ask an LLM to explain in layman’s terms).
I pulled the METR data and asked Gemini 3 Flash to vibecode the test, and there is a statistically significant (peak F-statistic = 7.79 corresponding to p-value about 0.03) break at Claude 3.5 Sonnet from ~8 to ~5-month doubling but not after it
Using @ryan_greenblatt’s updated 5-month doubling time: we reach the 1-month horizon from AI 2027 in ~5 doublings (Jan 2028) at 50% reliability, and ~8 doublings (Apr 2029) at 80% reliability. If I understand correctly, your model uses 80% reliability while also requiring 30x cheaper and faster than humans. It does seem like if the trend holds, by mid-2029 the models wouldn’t be much more expensive or slower. But I agree that if a lab tried to demonstrate “superhuman coder” on METR by the end of next year using expensive scaffolding / test-time compute (similar to o1 on ARC-AGI last year), it would probably exceed 30x human-cost, even if already 30x faster.
The thing METR is measuring seems slightly different than “superhuman coder”. My understanding is that they’re dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I’ve had it do tasks of this scale so I’m not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Example prompt:
The coding capabilities of LLMs are extremely impressive, but there’s still 2 core capabilities missing for “superhuman” coders to arrive that I talked about in this post.
In short, they are:
Continuous learning or very long context that can trivially hold an entire codebase in context.
General computer use that allows LLMs to operate on the entire tech stack, or otherwise refactoring all stacks to use tools that LLMs are comfortable with.
As of now, humans are required to supply context and glue things together. That won’t change if 2026 produces nothing but iterative improvements on general coding ability.
If an OpenAI Co-founder and Claude Code creator are resorting to Claude 4.5 Opus instead of coding themselves, it seems quite clear that we already have super human coders.
Using AI can help you get things done faster, even if it’s worse than you at coding
Isn’t getting working/production-ready code done faster the definition of being better than you at coding? It’s possible the creator of Claude Code is incorrect about this and he would be more productive long-term writing this code himself, or the code is actually unacceptable in ways that he hasn’t noticed yet, but if he’s correct that it’s more productive to have Claude write it, then Claude is better at coding than him.
If I use my IDE’s LSP functions to do a large automated refactor, is the IDE better than me at coding?
There are many more elements to “coding” than “writing code”, namely software design. As a software engineer I use Claude Code daily (I write maybe 1% of my total LOC by hand these days), but I still have to steer it. I have to tell it which architecture to use, which abstractions, correct it when it tries to use a shortcut instead of solving a problem at the root, etc.
When it can produce PRs which would pass code review on a competent software engineering team without that steering, we will have a superhuman coder.
I was thinking of “coder” as specifically the job of writing code, which I assume is what the Claude Code guy meant too. AI is clearly not reliable at system design yet.
System design is one part of designing software, but isn’t so much what I’m trying to point at here.
Claude Opus 4.5 still can’t produce or follow a simple plan to implement a feature on a mid-sized codebase independently.
As an example: earlier today I was implementing the feature of resuming a session when a client reconnects to a server after losing connection. One small part of this task is re-syncing the state once the current (server-side) task has finished.
Claude Code was not capable of designing a functioning solution to this problem in its planning mode (it kept trying to sync the state immediately upon connecting, leading to the client missing the result of the in-progress task).
The solution I chose for this specific instance of the problem was to add a state sync command to the server’s command queue for that session when a client reconnects. Claude Code updated the plan to show the exact code changes required (correctly).
However, when implementing the plan, it forgot to actually make the relevant change to add the command to the queue. End-to-end tests caught this, and Claude’s solution was to automatically do a state sync after every task. It did not implement what was written in the plan. I gave it a nudge to re-read the plan, which was enough to make it see the mistake and correct it.
Compared to if I had asked a human co-worker to make the same change, the difference is stark. We are still a way off from superhuman coders.
Not necessarily. You can have a superhuman centaur (AI+Human) system without the AI alone being superhuman.
I do think we are in a coding overhang.
Current harnesses seem far from the ceiling and could be improved a lot. One example: you can significantly boost output quality by using simple tricks, like telling Claude Code to implement something and then explicitly asking it to self-review. You could get slightly more creative—say, ask to implement something security-sensitive, then ask to break it.
And I feel the same way Andrej does. Even though I consider myself relatively adept at using AI agents, I feel like I am doing a quite horrible job. In principle, these tools could be utilized much more efficiently.
But perhaps even larger bottleneck is on the organizational level. Existing standard workflows weren’t designed for a world with AI agents, and it takes a long time to invent and implement new ones.
Also, Claude 3.7 was released just 10 months ago, and it was barely usable, the difference between 3.7 Sonnet and 4.5 Opus feels comparable (even if less dramatic) to the difference between GPT-3 and GPT-4.