As for big contexts and their utility, I tried to cover it in my analysis[1] of the ARC-AGI benchmark in the high-cost regime where Claude Sonnet 4.5, Grok 4, GPT-5-pro form another straight line with low inclination.
However, it could have been invalidated by the appearance of Grok 4 Fast on the ARC-AGI-1 benchmark. While Grok 4 Fast is a low-cost model, it significantly surpassed the Pareto frontier.
I’m distinguishing sequential inference scaling (the length of a single reasoning trace, specifically the output tokens) from parallel scaling (or more generally agentic multi-trace interaction processes). GPT-5 Pro and such will work with more tokens per request by running things in parallel, agentic scaffolds will generate more tokens by spinning up new instances for subtasks as the situation develops, and often there are many more input tokens in a context than there are output tokens generated within a single reasoning trace.
Except that the METR evaluation had GPT-5 use a budget of more than a hundred dollars per run and, as far as I understand, a context window of 400K tokens. The price of usage of GPT-5 is $10/M output tokens.
As for big contexts and their utility, I tried to cover it in my analysis[1] of the ARC-AGI benchmark in the high-cost regime where Claude Sonnet 4.5, Grok 4, GPT-5-pro form another straight line with low inclination.
However, it could have been invalidated by the appearance of Grok 4 Fast on the ARC-AGI-1 benchmark. While Grok 4 Fast is a low-cost model, it significantly surpassed the Pareto frontier.
I’m distinguishing sequential inference scaling (the length of a single reasoning trace, specifically the output tokens) from parallel scaling (or more generally agentic multi-trace interaction processes). GPT-5 Pro and such will work with more tokens per request by running things in parallel, agentic scaffolds will generate more tokens by spinning up new instances for subtasks as the situation develops, and often there are many more input tokens in a context than there are output tokens generated within a single reasoning trace.