Excellent comment, thank you! I’m actually inclined to agree with you, maybe we should edit the starting level of programming ability to be more in the amateur range than the professional range. Important clarification though: The current AI-2027 stats say that it’s at the bottom of the professional range in mid-2025. Which IIUC means it’s like a bad human professional coder—someone who does make a living coding, but who is actually below average. Also, it’s not yet mid-2025, we’ll see what the summer will bring.
I do agree with you though that it’s not clear it even qualifies as a bad professional. It seems like it’ll probably be worse at longer-horizon tasks than a bad professional, but maybe better at short-horizon coding tasks?
I don’t buy your arguments that we aren’t seeing improvement on “~1-hour human tasks.” Even the graph you cite shows improvement (albeit a regression with Sonnet 3.7 in particular).
I do like your point about the baseliners being nerfed and much worse than repo maintainers though. That is causing me to put less weight on the METR benchmark in particular. Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I’m curious if you agree.
Seconding Daniel, thanks for the comment! I decided to adjust down the early numbers to be below the human professional range until Dec 2025[1] due to agreeing with the considerations you raised about about longer horizon tasks which should be included in how these ranges are defined.
Excellent comment, thank you! I’m actually inclined to agree with you, maybe we should edit the starting level of programming ability to be more in the amateur range than the professional range. Important clarification though: The current AI-2027 stats say that it’s at the bottom of the professional range in mid-2025. Which IIUC means it’s like a bad human professional coder—someone who does make a living coding, but who is actually below average. Also, it’s not yet mid-2025, we’ll see what the summer will bring.
I do agree with you though that it’s not clear it even qualifies as a bad professional. It seems like it’ll probably be worse at longer-horizon tasks than a bad professional, but maybe better at short-horizon coding tasks?
I don’t buy your arguments that we aren’t seeing improvement on “~1-hour human tasks.” Even the graph you cite shows improvement (albeit a regression with Sonnet 3.7 in particular).
I do like your point about the baseliners being nerfed and much worse than repo maintainers though. That is causing me to put less weight on the METR benchmark in particular. Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I’m curious if you agree.
Seconding Daniel, thanks for the comment! I decided to adjust down the early numbers to be below the human professional range until Dec 2025[1] due to agreeing with the considerations you raised about about longer horizon tasks which should be included in how these ranges are defined.
Note that these are based on internal capabilities, so that translates to the best public models reaching the low human range in early-mid 2026.
Sweet! Thanks for taking my points into consideration! :)
I’ll take a look. Thanks for sharing.