I think it’s fair to say I predicted this—I expected exponential growth in task length to become a sigmoid in the short term:
In particular, I expected that Claude’s decreased performance on Pokemon with Sonnet 4.5 indicated that it’s task length would not be very high. Certainly not 30 hours, though I understand that Anthropic did not claim 30 hours human equivalent, I still believe that their claim of 30 hours continuous software engineering seems dubious—what exactly does that number actually mean, if it does not indicate even 2 hours of human-equivalent autonomy? I can write a program that “remains coherent” while “working continuously” for 30 hours by simply throttling GPT-N to work very slowly by throttling tokens/hour for any N >= 3. This result decreases my trust in Anthropic’s PR machine (which was already pretty low).
To be clear, this is only one data point and I may well be proven wrong very soon.
However, I think we can say that the “faster exponential” for inference scaling which some people expected did not hold up.
I am afraid that this is comparing apples (Anthropic’s models) to oranges (OpenAI’s models). Claude Sonnet 4 had a time horizon of 68 minutes, C. Opus 4 -- of 80 minutes, C. Sonnet 4.5 -- of 113 minutes, on par[1] with Grok 4. Similarly, o4-mini and o3 had a horizon of 78 and 92 minutes, GPT-5 (aka o4?) -- of 137 minutes. Had it not been for spurious failures, GPT-5 could have reached a horizon of 161 min,[2] and details on evaluation of other models are waiting to be published.
While xAI might be algorithmically behind, they confessed that Grok reached the results because xAI spent similar amounts of compute on pretraining and RL. Alas, xAI’s confession doesn’t rule out exponential growth slowing down instead of becoming fully sigmoidal.
So what did Anthropic do to have a lower time horizon than OpenAI while making Claude Sonnet 4.5 excel at the ARC-AGI-2 benchmark? Or Anthropic has an unreleased Claude Opus 4.5(?) whose time horizon is slightly bigger? The horizon of C. Opus 4 was 80 min, the horizon of C. Sonnet 4 was 68 min. Were the ratio of Opus and Sonnet horizons to stay the same, Anthropic would have C. Opus 4.5 with a time horizon of GPT-5…
At this rate we might learn that Anthropic’s SC is a SAR (say hi to Agent-4 without Agent-3?) and OpenAI’s SC needs to develop research taste...
In my experience using the LLM wrapper IDEs (cursor, windsurf, etc), if I ask the model to do some task where one of the assumptions I was making when writing the task was wrong (e.g. I ask it to surface some piece of information to the user in the response to some endpoint, but that piece of information doesn’t actually exist until a later step of the process), GPT-5 will spin for a long time and go off and do stuff to my codebase until it gets some result which looks like success if you squint, while Sonnet 4.5 will generally break out of the loop and ask me for clarification.
Sonnet 4.5′s behavior is what I want as a user but probably scores worse on the METR benchmark.
Claude Sonnet 4.5′s 50% task horizon is 1 hr 53 min, putting it slightly behind GPT-5′s 2 hr 15 min score.
https://x.com/METR_Evals/status/1976331315772580274
The graph is slightly more informative: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
I think it’s fair to say I predicted this—I expected exponential growth in task length to become a sigmoid in the short term:
In particular, I expected that Claude’s decreased performance on Pokemon with Sonnet 4.5 indicated that it’s task length would not be very high. Certainly not 30 hours, though I understand that Anthropic did not claim 30 hours human equivalent, I still believe that their claim of 30 hours continuous software engineering seems dubious—what exactly does that number actually mean, if it does not indicate even 2 hours of human-equivalent autonomy? I can write a program that “remains coherent” while “working continuously” for 30 hours by simply throttling GPT-N to work very slowly by throttling tokens/hour for any N >= 3. This result decreases my trust in Anthropic’s PR machine (which was already pretty low).
To be clear, this is only one data point and I may well be proven wrong very soon.
However, I think we can say that the “faster exponential” for inference scaling which some people expected did not hold up.
I am afraid that this is comparing apples (Anthropic’s models) to oranges (OpenAI’s models). Claude Sonnet 4 had a time horizon of 68 minutes, C. Opus 4 -- of 80 minutes, C. Sonnet 4.5 -- of 113 minutes, on par[1] with Grok 4. Similarly, o4-mini and o3 had a horizon of 78 and 92 minutes, GPT-5 (aka o4?) -- of 137 minutes. Had it not been for spurious failures, GPT-5 could have reached a horizon of 161 min,[2] and details on evaluation of other models are waiting to be published.
While xAI might be algorithmically behind, they confessed that Grok reached the results because xAI spent similar amounts of compute on pretraining and RL. Alas, xAI’s confession doesn’t rule out exponential growth slowing down instead of becoming fully sigmoidal.
Ironically, the latter figure is almost on par with Greenblatt’s prediction of a horizon which is 15 minutes shorter than twice the horizon of o3.
So what did Anthropic do to have a lower time horizon than OpenAI while making Claude Sonnet 4.5 excel at the ARC-AGI-2 benchmark? Or Anthropic has an unreleased Claude Opus 4.5(?) whose time horizon is slightly bigger? The horizon of C. Opus 4 was 80 min, the horizon of C. Sonnet 4 was 68 min. Were the ratio of Opus and Sonnet horizons to stay the same, Anthropic would have C. Opus 4.5 with a time horizon of GPT-5…
At this rate we might learn that Anthropic’s SC is a SAR (say hi to Agent-4 without Agent-3?) and OpenAI’s SC needs to develop research taste...
In my experience using the LLM wrapper IDEs (cursor, windsurf, etc), if I ask the model to do some task where one of the assumptions I was making when writing the task was wrong (e.g. I ask it to surface some piece of information to the user in the response to some endpoint, but that piece of information doesn’t actually exist until a later step of the process), GPT-5 will spin for a long time and go off and do stuff to my codebase until it gets some result which looks like success if you squint, while Sonnet 4.5 will generally break out of the loop and ask me for clarification.
Sonnet 4.5′s behavior is what I want as a user but probably scores worse on the METR benchmark.