The thing METR is measuring seems slightly different than “superhuman coder”. My understanding is that they’re dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
This is the first time I’ve had it do tasks of this scale so I’m not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Can you go through @TODO.md, delegating each task to opus subagents and ensuring that they understand all of the necessary context and implement the task, check it off, and commit it, then move onto the next task until the list is done?
The thing METR is measuring seems slightly different than “superhuman coder”. My understanding is that they’re dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I’ve had it do tasks of this scale so I’m not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Example prompt: