I think there is nuance about the downlift study that would be helpful to highlight:
Many participants used Sonnet 3.7 in Cursor for the first time (chat vs agent usage is a different skillset).
Sonnet 3.7 was notoriously bad in Cursor compared to Claude Code (since it was post-trained with the CC harness). I personally spent a few hours updating the system prompt in Cursor so that it became more usable.
Many people outside of Anthropic feel like Opus 4.5 is another “Sonnet 3.5 moment.”
We’ve learned a lot more about how to code with AI since then. Anthropic obviously teaches and sets up best practices internally.
There was in fact one person in the study who did accurately predict their uplift (+38%). IIRC they were also the most experienced with coding agents! They wrote a thread on the topic here.
This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.
I was aware of all of that except point 2. I think it undercuts the result “AI models aren’t useful for coding” but it doesn’t undercut the result “people tend to overestimate how much AI is helping them.”
Re: the one guy with the 38% uplift: Did he accurately predict it in advance? I can’t tell from skimming the thread.
Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).
In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).
That’s a reasonable point, but, going in the other direction, Anthropic people are probably biased towards overestimating the value of their models in particular.
Like, I’m at like 20% that Anthropic is currently getting 2x or more coding uplift. It’s possible (for the reasons you mention) but I don’t think it’s the most likely scenario.
I think there is nuance about the downlift study that would be helpful to highlight:
Many participants used Sonnet 3.7 in Cursor for the first time (chat vs agent usage is a different skillset).
Sonnet 3.7 was notoriously bad in Cursor compared to Claude Code (since it was post-trained with the CC harness). I personally spent a few hours updating the system prompt in Cursor so that it became more usable.
Many people outside of Anthropic feel like Opus 4.5 is another “Sonnet 3.5 moment.”
We’ve learned a lot more about how to code with AI since then. Anthropic obviously teaches and sets up best practices internally.
There was in fact one person in the study who did accurately predict their uplift (+38%). IIRC they were also the most experienced with coding agents! They wrote a thread on the topic here.
This is not to say that it’s true that Anthropic employees are getting that high of an uplift, but may make it a bit more believable.
I was aware of all of that except point 2. I think it undercuts the result “AI models aren’t useful for coding” but it doesn’t undercut the result “people tend to overestimate how much AI is helping them.”
Re: the one guy with the 38% uplift: Did he accurately predict it in advance? I can’t tell from skimming the thread.
Hmm, my thought was that devs (or at least Anthropic folks) have improved their ability to estimate how much AI is helping us since the release of the first truly agentic model? My feeling is that most top-end people should be better calibrated despite the moving target. Most people in the study had spent less than 50 hours (except for one of the folks who performed well), so I don’t think we cnnuse the study to say much about how things change over the course months or a year of usage and training (unless we do another study I guess).
In terms of the accurate prediction, I’m not recalling what exactly made me believe this, though if you look at the first chart in the METR thread, the confidence intervals of the predicted uplift from the devs is below the 38%. The average thought they were 24% faster at the beginning of the study (so, in fact, he probably underestimated his uplift a bit).
That’s a reasonable point, but, going in the other direction, Anthropic people are probably biased towards overestimating the value of their models in particular.
Like, I’m at like 20% that Anthropic is currently getting 2x or more coding uplift. It’s possible (for the reasons you mention) but I don’t think it’s the most likely scenario.