I was one of the developers in the @METR_Evals study. Some thoughts:
1. This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.
I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way.
2. Historically I’ve lost some of my AI speed ups to cleaning up the same issues LLM code would introduce, often relatively simple violations of code conventions lik e using || instead of ??
A bunch of this is avoidable with stored system prompts which I was lazy about writing. Cursor has now made this easier and even attempts to learn repeatable rules “The user prefers X” that will get reused, saving time here.
3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an “open-source developer” has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I’m not.
4. As a developer in the study, it’s striking to me how much more capable the models have gotten since February (when I was participating in the study).
I’m trying to recall if I was even using agents at the start. Certainly the later models (Opus 4, Gemini 2.5 Pro, o3 could just do vastly with less guidance) than 3.6, o1, etc.
For me, not going over my own data in the study, I could buy that maybe i was being slowed down a few months ago, but it is much much harder to believe now.
5. There was a selection effect in which tasks I submitted to the study. (a) I didn’t want to risk getting randomized to “no AI” on tasks that felt sufficiently important or daunting to do without AI assistence. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn’t submit those tasks to study even though AI speed up might have been larger.
6. I think if the result is valid at this point in time, that’s one thing, I think if people are citing in another 3 months time, they’ll be making a mistake (and I hope Metr has published a follow-up).
Apologies for the impoliteness, but… man, it sure sounds like you’re searching for reasons to dismiss the study results. Which sure is a red flag when the study results basically say “your remembered experience is that AI sped you up, and your remembered experience is unambiguously wrong about that”.
Like, look, when someone comes along with a nice clean study showing that your own brain is lying to you, that has got to be one of the worst possible times to go looking for reasons to dismiss the study.
I’m not pushing against the study results so much as what I think are misinterpretations people are going to make off of this study.
If the claim is “on a selected kind of tasks, developers in early 2025 predominantly using models like Claude 3.5 and 3.7 were slowed down when they though they were sped up”, then I’m not dismissing that. I don’t think the study is that clean or unambiguous giving methodological challenges, but I find it quite plausible.
In the above, I do only the following: (1) offer explanation for the result, (2) point out that I individually feel misrepresented by a particular descriptor use, (3) point out and affirm points the authors also make (a) about this being point-in-time, (b) there being selection effects at play.
You can say that if I feel now that I’m being speed up, I should be less confident given the results, and to that I say yeah, I am. And I’m surprised by the result too.
There’s a claim you’re making here that “I went looking for reasons” that feels weird. I don’t take it that whenever a result is “your remembered experience is wrong”, I’m being epistemically unvirtuous if I question it or discuss details. To repeat, I question the interpretation/generalization some might make rather than the raw result or even what the authors interpret it as, and I think as a participant I’m better positioned to notice the misgeneralization than just hearing the headline result (people reading the actual paper probably end up in the right place).
Fwiw, for me the calculated individual speed up was [-200%, 40%], which while it does weight predominantly in the negative (I got these numbers from the authors after writing my above comments). I’m not sure if that counts as unambiguously wrong about my remembered experience.
This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.
The temptation to multitask strikes me as a very likely cause of the loss of productivity. It is why I virtually never use reasoning models except for deep research.
I (study author) responded to some of Ruby’s points on twitter. Delighted for devs including Ruby to discuss their experience publicly, I think it’s helpful for people to get a richer sense!
I was one of the developers in the @METR_Evals study. Some thoughts:
1. This is much less true of my participation in the study where I was more conscientious, but I feel like historically a lot of my AI speed-up gains were eaten by the fact that while a prompt was running, I’d look at something else (FB, X, etc) and continue to do so for much longer than it took the prompt to run.
I discovered two days ago that Cursor has (or now has) a feature you can enable to ring a bell when the prompt is done. I expect to reclaim a lot of the AI gains this way.
2. Historically I’ve lost some of my AI speed ups to cleaning up the same issues LLM code would introduce, often relatively simple violations of code conventions lik e using || instead of ??
A bunch of this is avoidable with stored system prompts which I was lazy about writing. Cursor has now made this easier and even attempts to learn repeatable rules “The user prefers X” that will get reused, saving time here.
3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an “open-source developer” has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I’m not.
4. As a developer in the study, it’s striking to me how much more capable the models have gotten since February (when I was participating in the study).
I’m trying to recall if I was even using agents at the start. Certainly the later models (Opus 4, Gemini 2.5 Pro, o3 could just do vastly with less guidance) than 3.6, o1, etc.
For me, not going over my own data in the study, I could buy that maybe i was being slowed down a few months ago, but it is much much harder to believe now.
5. There was a selection effect in which tasks I submitted to the study. (a) I didn’t want to risk getting randomized to “no AI” on tasks that felt sufficiently important or daunting to do without AI assistence. (b) Neatly packaged and well-scoped tasks felt suitable for the study, large open-ended greenfield stuff felt harder to legibilize, so I didn’t submit those tasks to study even though AI speed up might have been larger.
6. I think if the result is valid at this point in time, that’s one thing, I think if people are citing in another 3 months time, they’ll be making a mistake (and I hope Metr has published a follow-up).
Apologies for the impoliteness, but… man, it sure sounds like you’re searching for reasons to dismiss the study results. Which sure is a red flag when the study results basically say “your remembered experience is that AI sped you up, and your remembered experience is unambiguously wrong about that”.
Like, look, when someone comes along with a nice clean study showing that your own brain is lying to you, that has got to be one of the worst possible times to go looking for reasons to dismiss the study.
I’m not pushing against the study results so much as what I think are misinterpretations people are going to make off of this study.
If the claim is “on a selected kind of tasks, developers in early 2025 predominantly using models like Claude 3.5 and 3.7 were slowed down when they though they were sped up”, then I’m not dismissing that. I don’t think the study is that clean or unambiguous giving methodological challenges, but I find it quite plausible.
In the above, I do only the following: (1) offer explanation for the result, (2) point out that I individually feel misrepresented by a particular descriptor use, (3) point out and affirm points the authors also make (a) about this being point-in-time, (b) there being selection effects at play.
You can say that if I feel now that I’m being speed up, I should be less confident given the results, and to that I say yeah, I am. And I’m surprised by the result too.
There’s a claim you’re making here that “I went looking for reasons” that feels weird. I don’t take it that whenever a result is “your remembered experience is wrong”, I’m being epistemically unvirtuous if I question it or discuss details. To repeat, I question the interpretation/generalization some might make rather than the raw result or even what the authors interpret it as, and I think as a participant I’m better positioned to notice the misgeneralization than just hearing the headline result (people reading the actual paper probably end up in the right place).
Fwiw, for me the calculated individual speed up was [-200%, 40%], which while it does weight predominantly in the negative (I got these numbers from the authors after writing my above comments). I’m not sure if that counts as unambiguously wrong about my remembered experience.
I think it doesn’t—our dev effects are so, so noisy!
The temptation to multitask strikes me as a very likely cause of the loss of productivity. It is why I virtually never use reasoning models except for deep research.
I (study author) responded to some of Ruby’s points on twitter. Delighted for devs including Ruby to discuss their experience publicly, I think it’s helpful for people to get a richer sense!