I am afraid that this is comparing apples (Anthropic’s models) to oranges (OpenAI’s models). Claude Sonnet 4 had a time horizon of 68 minutes, C. Opus 4 -- of 80 minutes, C. Sonnet 4.5 -- of 113 minutes, on par[1] with Grok 4. Similarly, o4-mini and o3 had a horizon of 78 and 92 minutes, GPT-5 (aka o4?) -- of 137 minutes. Had it not been for spurious failures, GPT-5 could have reached a horizon of 161 min,[2] and details on evaluation of other models are waiting to be published.
While xAI might be algorithmically behind, they confessed that Grok reached the results because xAI spent similar amounts of compute on pretraining and RL. Alas, xAI’s confession doesn’t rule out exponential growth slowing down instead of becoming fully sigmoidal.
I am afraid that this is comparing apples (Anthropic’s models) to oranges (OpenAI’s models). Claude Sonnet 4 had a time horizon of 68 minutes, C. Opus 4 -- of 80 minutes, C. Sonnet 4.5 -- of 113 minutes, on par[1] with Grok 4. Similarly, o4-mini and o3 had a horizon of 78 and 92 minutes, GPT-5 (aka o4?) -- of 137 minutes. Had it not been for spurious failures, GPT-5 could have reached a horizon of 161 min,[2] and details on evaluation of other models are waiting to be published.
While xAI might be algorithmically behind, they confessed that Grok reached the results because xAI spent similar amounts of compute on pretraining and RL. Alas, xAI’s confession doesn’t rule out exponential growth slowing down instead of becoming fully sigmoidal.
Ironically, the latter figure is almost on par with Greenblatt’s prediction of a horizon which is 15 minutes shorter than twice the horizon of o3.