I’m distinguishing sequential inference scaling (the length of a single reasoning trace, specifically the output tokens) from parallel scaling (or more generally agentic multi-trace interaction processes). GPT-5 Pro and such will work with more tokens per request by running things in parallel, agentic scaffolds will generate more tokens by spinning up new instances for subtasks as the situation develops, and often there are many more input tokens in a context than there are output tokens generated within a single reasoning trace.
I’m distinguishing sequential inference scaling (the length of a single reasoning trace, specifically the output tokens) from parallel scaling (or more generally agentic multi-trace interaction processes). GPT-5 Pro and such will work with more tokens per request by running things in parallel, agentic scaffolds will generate more tokens by spinning up new instances for subtasks as the situation develops, and often there are many more input tokens in a context than there are output tokens generated within a single reasoning trace.