I think it’s fair to say I predicted this—I expected exponential growth in task length to become a sigmoid in the short term:
In particular, I expected that Claude’s decreased performance on Pokemon with Sonnet 4.5 indicated that it’s task length would not be very high. Certainly not 30 hours, though I understand that Anthropic did not claim 30 hours human equivalent, I still believe that their claim of 30 hours continuous software engineering seems dubious—what exactly does that number actually mean, if it does not indicate even 2 hours of human-equivalent autonomy? I can write a program that “remains coherent” while “working continuously” for 30 hours by simply throttling GPT-N to work very slowly by throttling tokens/hour for any N >= 3. This result decreases my trust in Anthropic’s PR machine (which was already pretty low).
To be clear, this is only one data point and I may well be proven wrong very soon.
However, I think we can say that the “faster exponential” for inference scaling which some people expected did not hold up.
I am afraid that this is comparing apples (Anthropic’s models) to oranges (OpenAI’s models). Claude Sonnet 4 had a time horizon of 68 minutes, C. Opus 4 -- of 80 minutes, C. Sonnet 4.5 -- of 113 minutes, on par[1] with Grok 4. Similarly, o4-mini and o3 had a horizon of 78 and 92 minutes, GPT-5 (aka o4?) -- of 137 minutes. Had it not been for spurious failures, GPT-5 could have reached a horizon of 161 min,[2] and details on evaluation of other models are waiting to be published.
While xAI might be algorithmically behind, they confessed that Grok reached the results because xAI spent similar amounts of compute on pretraining and RL. Alas, xAI’s confession doesn’t rule out exponential growth slowing down instead of becoming fully sigmoidal.
The graph is slightly more informative: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
I think it’s fair to say I predicted this—I expected exponential growth in task length to become a sigmoid in the short term:
In particular, I expected that Claude’s decreased performance on Pokemon with Sonnet 4.5 indicated that it’s task length would not be very high. Certainly not 30 hours, though I understand that Anthropic did not claim 30 hours human equivalent, I still believe that their claim of 30 hours continuous software engineering seems dubious—what exactly does that number actually mean, if it does not indicate even 2 hours of human-equivalent autonomy? I can write a program that “remains coherent” while “working continuously” for 30 hours by simply throttling GPT-N to work very slowly by throttling tokens/hour for any N >= 3. This result decreases my trust in Anthropic’s PR machine (which was already pretty low).
To be clear, this is only one data point and I may well be proven wrong very soon.
However, I think we can say that the “faster exponential” for inference scaling which some people expected did not hold up.
I am afraid that this is comparing apples (Anthropic’s models) to oranges (OpenAI’s models). Claude Sonnet 4 had a time horizon of 68 minutes, C. Opus 4 -- of 80 minutes, C. Sonnet 4.5 -- of 113 minutes, on par[1] with Grok 4. Similarly, o4-mini and o3 had a horizon of 78 and 92 minutes, GPT-5 (aka o4?) -- of 137 minutes. Had it not been for spurious failures, GPT-5 could have reached a horizon of 161 min,[2] and details on evaluation of other models are waiting to be published.
While xAI might be algorithmically behind, they confessed that Grok reached the results because xAI spent similar amounts of compute on pretraining and RL. Alas, xAI’s confession doesn’t rule out exponential growth slowing down instead of becoming fully sigmoidal.
Ironically, the latter figure is almost on par with Greenblatt’s prediction of a horizon which is 15 minutes shorter than twice the horizon of o3.