So what did Anthropic do to have a lower time horizon than OpenAI while making Claude Sonnet 4.5 excel at the ARC-AGI-2 benchmark? Or Anthropic has an unreleased Claude Opus 4.5(?) whose time horizon is slightly bigger? The horizon of C. Opus 4 was 80 min, the horizon of C. Sonnet 4 was 68 min. Were the ratio of Opus and Sonnet horizons to stay the same, Anthropic would have C. Opus 4.5 with a time horizon of GPT-5…
At this rate we might learn that Anthropic’s SC is a SAR (say hi to Agent-4 without Agent-3?) and OpenAI’s SC needs to develop research taste...
In my experience using the LLM wrapper IDEs (cursor, windsurf, etc), if I ask the model to do some task where one of the assumptions I was making when writing the task was wrong (e.g. I ask it to surface some piece of information to the user in the response to some endpoint, but that piece of information doesn’t actually exist until a later step of the process), GPT-5 will spin for a long time and go off and do stuff to my codebase until it gets some result which looks like success if you squint, while Sonnet 4.5 will generally break out of the loop and ask me for clarification.
Sonnet 4.5′s behavior is what I want as a user but probably scores worse on the METR benchmark.
So what did Anthropic do to have a lower time horizon than OpenAI while making Claude Sonnet 4.5 excel at the ARC-AGI-2 benchmark? Or Anthropic has an unreleased Claude Opus 4.5(?) whose time horizon is slightly bigger? The horizon of C. Opus 4 was 80 min, the horizon of C. Sonnet 4 was 68 min. Were the ratio of Opus and Sonnet horizons to stay the same, Anthropic would have C. Opus 4.5 with a time horizon of GPT-5…
At this rate we might learn that Anthropic’s SC is a SAR (say hi to Agent-4 without Agent-3?) and OpenAI’s SC needs to develop research taste...
In my experience using the LLM wrapper IDEs (cursor, windsurf, etc), if I ask the model to do some task where one of the assumptions I was making when writing the task was wrong (e.g. I ask it to surface some piece of information to the user in the response to some endpoint, but that piece of information doesn’t actually exist until a later step of the process), GPT-5 will spin for a long time and go off and do stuff to my codebase until it gets some result which looks like success if you squint, while Sonnet 4.5 will generally break out of the loop and ask me for clarification.
Sonnet 4.5′s behavior is what I want as a user but probably scores worse on the METR benchmark.