Using @ryan_greenblatt’s updated 5-month doubling time: we reach the 1-month horizon from AI 2027 in ~5 doublings (Jan 2028) at 50% reliability, and ~8 doublings (Apr 2029) at 80% reliability. If I understand correctly, your model uses 80% reliability while also requiring 30x cheaper and faster than humans. It does seem like if the trend holds, by mid-2029 the models wouldn’t be much more expensive or slower. But I agree that if a lab tried to demonstrate “superhuman coder” on METR by the end of next year using expensive scaffolding / test-time compute (similar to o1 on ARC-AGI last year), it would probably exceed 30x human-cost, even if already 30x faster.
The thing METR is measuring seems slightly different than “superhuman coder”. My understanding is that they’re dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
This is the first time I’ve had it do tasks of this scale so I’m not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Can you go through @TODO.md, delegating each task to opus subagents and ensuring that they understand all of the necessary context and implement the task, check it off, and commit it, then move onto the next task until the list is done?
Using @ryan_greenblatt’s updated 5-month doubling time: we reach the 1-month horizon from AI 2027 in ~5 doublings (Jan 2028) at 50% reliability, and ~8 doublings (Apr 2029) at 80% reliability. If I understand correctly, your model uses 80% reliability while also requiring 30x cheaper and faster than humans. It does seem like if the trend holds, by mid-2029 the models wouldn’t be much more expensive or slower. But I agree that if a lab tried to demonstrate “superhuman coder” on METR by the end of next year using expensive scaffolding / test-time compute (similar to o1 on ARC-AGI last year), it would probably exceed 30x human-cost, even if already 30x faster.
The thing METR is measuring seems slightly different than “superhuman coder”. My understanding is that they’re dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I’ve had it do tasks of this scale so I’m not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Example prompt: