Wonderful post! I appreciate the effort to create plenty of quantitative hooks for the reader to grab onto and wrestle with.
I’m struggling with the initial claim that we have agentic programmers that are as good as human pros. The piece suggests that we already have Human Pro level agents right now!?
Maybe I’m missing something that exists internally at the AI labs? But we don’t have access to such programmers. So far, all “agentic” programmers are effectively useless at real work. They struggle with real codebases, and novel or newer APIs and languages. Most programmers I know use LLMs as a sophisticated autocomplete and as a “search” tool. These offer nice speed-ups, greasing the wheels of programming. But I have not seen anything genuinely useful that is more agentic. No one is using Devin, lol. Note: I use Cursor and Claude constantly. I’m not some anti-AI programmer.
Concerns with the METR paper (Measuring AI Ability to Complete Long Tasks)
1. The METR paper compares the AI against what seems like pretty weak contractors (“baseliners”). I’m shocked that a task that takes someone 5 minutes would take the contractor 16x longer. This makes me think the human “baseline” compared to the AI is roughly 5-10x worse than an actual professional. A 5-minute task is so incredibly small that in the real work world, we wouldn’t even waste the breath talking about the task.
2. It’s unclear how nerfed the “baseliners” are compared to programmers in real life (see section 5.2 of the HCAST paper). It’s unclear to me if the baseliners can use an IDE (like a real programmer would use). Does the sign-in thing mean that the baseliners can’t use Reddit, GitHub, Stack Overflow, Kagi, internal notes like Obsidian, or Notion? What if the baseliners have adjusted their workflows to use LLMs, and now they can’t use them, and the “old skills” have atrophied, not giving a fair comparison to real work?
3. We see effectively no improvement pattern on the baseliner 1-hour tasks (see figure 9 on the METR paper)
Pulling the above 3 points together. I think that the “~1 hour human tasks” in the paper are actually closer to 10-15 minutes tasks for an actual professional with their full setup, IDE, and other tools, etc. And, as outlined in the charts above, we have not seen LLM improvement for these sorts of tasks. This aligns with my experience with agentic coding tools on the market, like Devin, that can solve very small problems, but most of the time, it’s quicker and easier to just do the work yourself. No one makes a Linear or Jira ticket for a 15-minute unit of work.
How does this impact the whole piece?
I think that actual meaningful creative problem-solving and engineering that causes a company to be able to recursively self-improve feels further away than AI 2027 claims. Additionally, I’m nervous about line fitting and extrapolating due to the minor agentic updates for recent models. Right now, I feel like I’m leaning toward Yann LeCun’s take that LLMs are inherently limited due to a weakness in stable reasoning over a long time horizon (lots of autoregressive tokens).
Happy to debate or discuss any of this. I’d also be down to do some Aug-2025 agentic bets if anyone has some cooked up? 😈
Excellent comment, thank you! I’m actually inclined to agree with you, maybe we should edit the starting level of programming ability to be more in the amateur range than the professional range. Important clarification though: The current AI-2027 stats say that it’s at the bottom of the professional range in mid-2025. Which IIUC means it’s like a bad human professional coder—someone who does make a living coding, but who is actually below average. Also, it’s not yet mid-2025, we’ll see what the summer will bring.
I do agree with you though that it’s not clear it even qualifies as a bad professional. It seems like it’ll probably be worse at longer-horizon tasks than a bad professional, but maybe better at short-horizon coding tasks?
I don’t buy your arguments that we aren’t seeing improvement on “~1-hour human tasks.” Even the graph you cite shows improvement (albeit a regression with Sonnet 3.7 in particular).
I do like your point about the baseliners being nerfed and much worse than repo maintainers though. That is causing me to put less weight on the METR benchmark in particular. Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I’m curious if you agree.
Seconding Daniel, thanks for the comment! I decided to adjust down the early numbers to be below the human professional range until Dec 2025[1] due to agreeing with the considerations you raised about about longer horizon tasks which should be included in how these ranges are defined.
It’s unclear to me if the baseliners can use an IDE (like a real programmer would use). Does the sign-in thing mean that the baseliners can’t use Reddit, GitHub, Stack Overflow, Kagi, internal notes like Obsidian, or Notion?
You will SSH into our server to work on the task, but you are allowed to use any tools you want compatible with this workflow, excluding copilot and any other AI tools, Wolfram Alpha, and online services that require sign-up. (Google does not count as an AI tool, ChatGPT does.) You can always use the internet to search for information (e.g. StackOverflow), even if the task instructions specifically say that internet usage is not allowed. [...]
You can connect your IDE to the task environment using the same SSH connection string. Here are docs about how to do this for VSCode (remember to ‘add new host’ rather than ‘connect to host’. and paste the entire ssh connection string, including ssh -J [...]) or PyCharm. Unfortunately it’s not terribly unusual for a connection to take ~20 minutes the first time (although the typical case is smaller).
[from the “Questions or issues” tab]
Can I use [software X]?
Tools that are compatible with your usual workflow and our set-up (e.g. VSCode extensions) are fine, tools that solve the task for you are not fine. So linters are good, Copilot bad.
The “20 minutes to connect an IDE” thing sounded worrying to me at first glance, but FWIW the paper claims that setup time was a non-issue in practice:
It is possible that ancillary technical issues (e.g. difficulties with setup) could consume a sig- nificant fraction of baseline time. In practice, we observe minimal such issues with technical set-up; the issues affecting clock times that do persist are concentrated in qualification tasks, in which human baseliners are interacting with our set-up for the first time. In 19 sampled instances of debug small libs qualification tasks, baseliners spent a mean of 9 minutes and median of 6 minutes on setup issues, relative to average total task time of 1.2 hours.
I’m making this comment mostly to point out the info above, but I also wanted to say that I agree with you about agentic coding, and I especially agree with @Michael Swift’s remark about “engineering taste.”
I’ve actually been getting a lot of value out of Cursor w/ 3.7 Sonnet lately, but I think largely due to the task I’m applying it to, which is effectively the best-case scenario for this kind of tool:
It’s frontend development...
...which is not my usual area, and which I am not very competent at on my own
...which is also, I hear, a strong point for most coding LLMs
It’s work on an internal-facing prototype which even internal users don’t see unless they toggle a setting manually.
So it’s low-risk, it doesn’t matter if the UI doesn’t follow brand conventions, etc.
Also, the requirements are unusually flexible and self-determined. I’m often free to just give up on something if both Claude and I are having a lot of trouble accomplishing it.
Under these conditions, it really does give me a large boost in the short term. (I say “in the short term” because I’m probably learning less in the process than I would otherwise. As others have observed before, the implications for junior developer hiring and the overall skills acquisition pipeline are… concerning.)
However, even in this context (and even in a largely unfamiliar codebase and programming language), the lack of “engineering taste” is evident to me and sometimes becomes a bottleneck. The tool does tend to write code that works in the sense of passing tests or meeting requirements, but it often
varies its design choices whimsically across successive requests (even in the same chat)
reinvents what it needs from scratch than reusing existing mechanisms (even mechanisms that it added itself in an earlier chat turn)
fails to refactor or delete old code that has been obviated by its recently applied changes
“uses a hammer to swat a fly,” writing elaborate and defensive code to perform very simple operations, with e.g. lots of guards against implausible or impossible edge cases
writes code in a standard or “textbook” style rather than adopting the house style of the project, even when it has been explicitly told to do the latter[1]
and other stuff along similar lines.
It’s conceivable that better prompting could resolve this (maybe ask for a high-level design first?). But if so, I’m confused why existing scaffolds don’t inject this stuff by default (and why the LLMs even need the instructions, rather than just doing this stuff by default on the basis of some very general idea like “I’m supposed to write good code”).
The one time I tried Claude Code, I showed it a complex backend service, told it that certain routes were slow, and asked it to find and fix the bottlenecks. (This was probably way too ambitious, but I had read stuff like this and came in with high expectations.) It proposed a few database migrations, all of which attempted to add things that already existed. This was so cringe that I never used the tool again. But I’d happily give it a second try if someone showed me a clear demonstration of a case in which it was uniquely useful.
Note that Cursor always explicitly tells it not to do this, via the following section of its system prompt:
# Following conventions
When making changes to files, first understand the file's code conventions. Mimic code style, use existing libraries and utilities, and follow existing patterns.
- NEVER assume that a given library is available, even if it is well known. Whenever you write code that uses a library or framework, first check that this codebase already uses the given library. For example, you might look at neighboring files, or check the package.json (or cargo.toml, and so on depending on the language).
- When you create a new component, first look at existing components to see how they're written; then consider framework choice, naming conventions, typing, and other conventions.
- When you edit a piece of code, first look at the code's surrounding context (especially its imports) to understand the code's choice of frameworks and libraries. Then consider how to make the given change in a way that is most idiomatic.
really nice comment that I also happen to agree with. As a programmer working with Claude Code and Cursor every day I have yet to see AI systems achieve “engineering taste”, which seems far easier than “research taste” as discussed by OPs. In my experience, these systems cannot perform medium-term planning and execution of tasks, even those that are clearly within distribution.
Perhaps the apparent limitations relate to the independent probability of things going wrong when you aren’t maintaining a consistent world-model / in-line learning and feedback.
For example, even if 0.90 of your actions are correct, if they all can independently tank your task then your probably of success after 6 actions is a coin flip. I feel you can see the contours of this effect in CPP (Claude plays pokemon). So while I like the METR’s proposed metrics in task-space, the “scaling curve” they are showing may not hold. This is because the tasks that help populate the y-axis are things that are in principle “one-shotable” by this set of model architectures, and thus don’t fall to problems with independent errors. This all leads me to believe that the “research taste” as discussed by OP is a lot further off, ultimately pushing take-off scenarios back
My best guess is you are a few months behind in your takes? The latest generation of thinking models can definitely do agentic frontend development and build small projects autonomously. It definitely still makes errors, and has blindspots that require human supervision, but in terms of skill level, the systems feel definitely comparable and usually superior to junior programmers (but when they fail, they fail hard and worse than real junior programmers).
The latest generation of thinking models can definitely do agentic frontend development
But does that imply that they’re general-purpose competent agentic programmers? The answers here didn’t seem consistent with that. Does your experience significantly diverge from that?
My current model is that it’s the standard “jagged capabilities frontier” on a per-task basis, where LLMs are good at some sufficiently “templated” projects, and then they fall apart on everything else. Their proficiency at frontend development is then mostly a sign of frontend code being relatively standardized[1]; not of them being sufficiently agent-y.
I guess quantifying it as “20% of the way from an amateur to a human pro” isn’t necessarily incorrect, depending on how you operationalize this number. But I think it’s also arguable that they haven’t actually 100%’d even amateur general-coding performance yet.
I. e., that most real-world frontend projects have incredibly low description length if expressed in the dictionary of some “frontend templates”, with this dictionary comprehensively represented in LLMs’ training sets.
(To clarify: These projects’ Kolmogorov complexity can still be high, but their cross-entropy relative to said dictionary is low.
Importantly, the cross-entropy relative to a given competent programmer’s “template-dictionary” can also be high, creating the somewhat-deceptive impression of LLMs being able to handle complex projects. But that apparent capability would then fail to generalize to domains in which real-world tasks aren’t short sentences in some pretrained dictionary. And I think we are observing that with e. g. nontrivial backend coding?)
Typically I operationalize “employable as a software engineer” as being capable of completing tasks like:
“Fix this error we’re getting on BetterStack.”
“Move our Redis cache from DigitalOcean to AWS.”
“Add and implement a cancellation feature for ZeroPath scans.”
“Add the results of this evaluation to our internal benchmark.”
These are pretty representative examples of the kinds of tasks your median software engineer will be getting and resolving on a day to day basis.
No chatbot or chatbot wrapper can complete tasks like these for an engineering team at present, incl. Devin et. al. Partly this is because most software engineering work is very high-context, in the sense that implementing the proper solution depends on understanding a large body of existing infrastructure, business knowledge, and code.
When people talk about models today doing “agentic development”, they’re usually explaining its ability to complete small projects in low-context situations, where all you need to understand is the prompt itself and software engineering as a discipline. That makes sense, because if you ask AIs to write (for example) a PONG game in javascript, the AI can complete each of the pieces in one pass, and fit everything it’s doing into one context window. But that kind of task is unlike the vast majority of things employed software engineers do today, which is why we’re not experiencing an intelligence explosion right this second.
If that was the case, wouldn’t Scott and Daniel develop the impressive AI-2027 website themselves with the help of AI Agents, instead of utilising your human webdev skills? /jk :D
Interesting. I get where you’re coming from for blank slate things or front end. But programming is rarely a blank slate like this. You have to work with existing codebases or esoteric libraries. Even with the context loaded (as well as I can) Cursor with Composer and Claude Sonnet 3.7 code (the CLI tool) have failed pretty miserably for me on simple work-related tasks. As things stand, I always regret using them and wish I wrote the code myself. Maybe this is a context problem that is solved when the models grow to use proper attention across the whole context window rather than using shortcuts or hacks due to memory limitations (this will get better as the newer NVIDIA chips land).
I would love to see an LLM fix a meaningful issue in an open source library.
Wonderful post! I appreciate the effort to create plenty of quantitative hooks for the reader to grab onto and wrestle with.
I’m struggling with the initial claim that we have agentic programmers that are as good as human pros. The piece suggests that we already have Human Pro level agents right now!?
Maybe I’m missing something that exists internally at the AI labs? But we don’t have access to such programmers. So far, all “agentic” programmers are effectively useless at real work. They struggle with real codebases, and novel or newer APIs and languages. Most programmers I know use LLMs as a sophisticated autocomplete and as a “search” tool. These offer nice speed-ups, greasing the wheels of programming. But I have not seen anything genuinely useful that is more agentic. No one is using Devin, lol. Note: I use Cursor and Claude constantly. I’m not some anti-AI programmer.
Concerns with the METR paper (Measuring AI Ability to Complete Long Tasks)
Most of the agentic programmer claims seem to be derived from the METR paper (Measuring AI Ability to Complete Long Tasks). After digging into the paper, the code, and the supporting paper (HCAST), I have some concerns.
1. The METR paper compares the AI against what seems like pretty weak contractors (“baseliners”). I’m shocked that a task that takes someone 5 minutes would take the contractor 16x longer. This makes me think the human “baseline” compared to the AI is roughly 5-10x worse than an actual professional. A 5-minute task is so incredibly small that in the real work world, we wouldn’t even waste the breath talking about the task.
See below Table 6 from the METR paper.
2. It’s unclear how nerfed the “baseliners” are compared to programmers in real life (see section 5.2 of the HCAST paper). It’s unclear to me if the baseliners can use an IDE (like a real programmer would use). Does the sign-in thing mean that the baseliners can’t use Reddit, GitHub, Stack Overflow, Kagi, internal notes like Obsidian, or Notion? What if the baseliners have adjusted their workflows to use LLMs, and now they can’t use them, and the “old skills” have atrophied, not giving a fair comparison to real work?
3. We see effectively no improvement pattern on the baseliner 1-hour tasks (see figure 9 on the METR paper)
Pulling the above 3 points together. I think that the “~1 hour human tasks” in the paper are actually closer to 10-15 minutes tasks for an actual professional with their full setup, IDE, and other tools, etc. And, as outlined in the charts above, we have not seen LLM improvement for these sorts of tasks. This aligns with my experience with agentic coding tools on the market, like Devin, that can solve very small problems, but most of the time, it’s quicker and easier to just do the work yourself. No one makes a Linear or Jira ticket for a 15-minute unit of work.
How does this impact the whole piece?
I think that actual meaningful creative problem-solving and engineering that causes a company to be able to recursively self-improve feels further away than AI 2027 claims. Additionally, I’m nervous about line fitting and extrapolating due to the minor agentic updates for recent models. Right now, I feel like I’m leaning toward Yann LeCun’s take that LLMs are inherently limited due to a weakness in stable reasoning over a long time horizon (lots of autoregressive tokens).
Happy to debate or discuss any of this. I’d also be down to do some Aug-2025 agentic bets if anyone has some cooked up? 😈
Excellent comment, thank you! I’m actually inclined to agree with you, maybe we should edit the starting level of programming ability to be more in the amateur range than the professional range. Important clarification though: The current AI-2027 stats say that it’s at the bottom of the professional range in mid-2025. Which IIUC means it’s like a bad human professional coder—someone who does make a living coding, but who is actually below average. Also, it’s not yet mid-2025, we’ll see what the summer will bring.
I do agree with you though that it’s not clear it even qualifies as a bad professional. It seems like it’ll probably be worse at longer-horizon tasks than a bad professional, but maybe better at short-horizon coding tasks?
I don’t buy your arguments that we aren’t seeing improvement on “~1-hour human tasks.” Even the graph you cite shows improvement (albeit a regression with Sonnet 3.7 in particular).
I do like your point about the baseliners being nerfed and much worse than repo maintainers though. That is causing me to put less weight on the METR benchmark in particular. Have you heard of https://openai.com/index/paperbench/ and https://github.com/METR/RE-Bench ? They seem like they have some genuine multi-hour agentic coding tasks, I’m curious if you agree.
Seconding Daniel, thanks for the comment! I decided to adjust down the early numbers to be below the human professional range until Dec 2025[1] due to agreeing with the considerations you raised about about longer horizon tasks which should be included in how these ranges are defined.
Note that these are based on internal capabilities, so that translates to the best public models reaching the low human range in early-mid 2026.
Sweet! Thanks for taking my points into consideration! :)
I’ll take a look. Thanks for sharing.
Great comment.
In the HCAST paper’s Appendix C.1, they link to their instructions doc for baseliners, which answers both of these questions. Quoting from the doc:
The “20 minutes to connect an IDE” thing sounded worrying to me at first glance, but FWIW the paper claims that setup time was a non-issue in practice:
I’m making this comment mostly to point out the info above, but I also wanted to say that I agree with you about agentic coding, and I especially agree with @Michael Swift’s remark about “engineering taste.”
I’ve actually been getting a lot of value out of Cursor w/ 3.7 Sonnet lately, but I think largely due to the task I’m applying it to, which is effectively the best-case scenario for this kind of tool:
It’s frontend development...
...which is not my usual area, and which I am not very competent at on my own
...which is also, I hear, a strong point for most coding LLMs
It’s work on an internal-facing prototype which even internal users don’t see unless they toggle a setting manually.
So it’s low-risk, it doesn’t matter if the UI doesn’t follow brand conventions, etc.
Also, the requirements are unusually flexible and self-determined. I’m often free to just give up on something if both Claude and I are having a lot of trouble accomplishing it.
Under these conditions, it really does give me a large boost in the short term. (I say “in the short term” because I’m probably learning less in the process than I would otherwise. As others have observed before, the implications for junior developer hiring and the overall skills acquisition pipeline are… concerning.)
However, even in this context (and even in a largely unfamiliar codebase and programming language), the lack of “engineering taste” is evident to me and sometimes becomes a bottleneck. The tool does tend to write code that works in the sense of passing tests or meeting requirements, but it often
varies its design choices whimsically across successive requests (even in the same chat)
reinvents what it needs from scratch than reusing existing mechanisms (even mechanisms that it added itself in an earlier chat turn)
fails to refactor or delete old code that has been obviated by its recently applied changes
“uses a hammer to swat a fly,” writing elaborate and defensive code to perform very simple operations, with e.g. lots of guards against implausible or impossible edge cases
writes code in a standard or “textbook” style rather than adopting the house style of the project, even when it has been explicitly told to do the latter[1]
and other stuff along similar lines.
It’s conceivable that better prompting could resolve this (maybe ask for a high-level design first?). But if so, I’m confused why existing scaffolds don’t inject this stuff by default (and why the LLMs even need the instructions, rather than just doing this stuff by default on the basis of some very general idea like “I’m supposed to write good code”).
The one time I tried Claude Code, I showed it a complex backend service, told it that certain routes were slow, and asked it to find and fix the bottlenecks. (This was probably way too ambitious, but I had read stuff like this and came in with high expectations.) It proposed a few database migrations, all of which attempted to add things that already existed. This was so cringe that I never used the tool again. But I’d happily give it a second try if someone showed me a clear demonstration of a case in which it was uniquely useful.
Note that Cursor always explicitly tells it not to do this, via the following section of its system prompt:
Thanks! Appreciate you digging that up :). Happy to conclude that my second point is likely moot.
really nice comment that I also happen to agree with. As a programmer working with Claude Code and Cursor every day I have yet to see AI systems achieve “engineering taste”, which seems far easier than “research taste” as discussed by OPs. In my experience, these systems cannot perform medium-term planning and execution of tasks, even those that are clearly within distribution.
Perhaps the apparent limitations relate to the independent probability of things going wrong when you aren’t maintaining a consistent world-model / in-line learning and feedback.
For example, even if 0.90 of your actions are correct, if they all can independently tank your task then your probably of success after 6 actions is a coin flip. I feel you can see the contours of this effect in CPP (Claude plays pokemon). So while I like the METR’s proposed metrics in task-space, the “scaling curve” they are showing may not hold. This is because the tasks that help populate the y-axis are things that are in principle “one-shotable” by this set of model architectures, and thus don’t fall to problems with independent errors. This all leads me to believe that the “research taste” as discussed by OP is a lot further off, ultimately pushing take-off scenarios back
My best guess is you are a few months behind in your takes? The latest generation of thinking models can definitely do agentic frontend development and build small projects autonomously. It definitely still makes errors, and has blindspots that require human supervision, but in terms of skill level, the systems feel definitely comparable and usually superior to junior programmers (but when they fail, they fail hard and worse than real junior programmers).
But does that imply that they’re general-purpose competent agentic programmers? The answers here didn’t seem consistent with that. Does your experience significantly diverge from that?
My current model is that it’s the standard “jagged capabilities frontier” on a per-task basis, where LLMs are good at some sufficiently “templated” projects, and then they fall apart on everything else. Their proficiency at frontend development is then mostly a sign of frontend code being relatively standardized[1]; not of them being sufficiently agent-y.
I guess quantifying it as “20% of the way from an amateur to a human pro” isn’t necessarily incorrect, depending on how you operationalize this number. But I think it’s also arguable that they haven’t actually 100%’d even amateur general-coding performance yet.
I. e., that most real-world frontend projects have incredibly low description length if expressed in the dictionary of some “frontend templates”, with this dictionary comprehensively represented in LLMs’ training sets.
(To clarify: These projects’ Kolmogorov complexity can still be high, but their cross-entropy relative to said dictionary is low.
Importantly, the cross-entropy relative to a given competent programmer’s “template-dictionary” can also be high, creating the somewhat-deceptive impression of LLMs being able to handle complex projects. But that apparent capability would then fail to generalize to domains in which real-world tasks aren’t short sentences in some pretrained dictionary. And I think we are observing that with e. g. nontrivial backend coding?)
Typically I operationalize “employable as a software engineer” as being capable of completing tasks like:
“Fix this error we’re getting on BetterStack.”
“Move our Redis cache from DigitalOcean to AWS.”
“Add and implement a cancellation feature for ZeroPath scans.”
“Add the results of this evaluation to our internal benchmark.”
These are pretty representative examples of the kinds of tasks your median software engineer will be getting and resolving on a day to day basis.
No chatbot or chatbot wrapper can complete tasks like these for an engineering team at present, incl. Devin et. al. Partly this is because most software engineering work is very high-context, in the sense that implementing the proper solution depends on understanding a large body of existing infrastructure, business knowledge, and code.
When people talk about models today doing “agentic development”, they’re usually explaining its ability to complete small projects in low-context situations, where all you need to understand is the prompt itself and software engineering as a discipline. That makes sense, because if you ask AIs to write (for example) a PONG game in javascript, the AI can complete each of the pieces in one pass, and fit everything it’s doing into one context window. But that kind of task is unlike the vast majority of things employed software engineers do today, which is why we’re not experiencing an intelligence explosion right this second.
If that was the case, wouldn’t Scott and Daniel develop the impressive AI-2027 website themselves with the help of AI Agents, instead of utilising your human webdev skills? /jk :D
No, I think it’s a fair question. Show me a non-trivial project coded end-to-end by an AI agent, and I’ll believe these claims.
Interesting. I get where you’re coming from for blank slate things or front end. But programming is rarely a blank slate like this. You have to work with existing codebases or esoteric libraries. Even with the context loaded (as well as I can) Cursor with Composer and Claude Sonnet 3.7 code (the CLI tool) have failed pretty miserably for me on simple work-related tasks. As things stand, I always regret using them and wish I wrote the code myself. Maybe this is a context problem that is solved when the models grow to use proper attention across the whole context window rather than using shortcuts or hacks due to memory limitations (this will get better as the newer NVIDIA chips land).
I would love to see an LLM fix a meaningful issue in an open source library.