Seems clear that it is below the “faster” projected reasoning model scaling curve.
It looks like inference time scaling is not panning out to be as useful as some hoped / feared.
Degradation on 80% success task length makes me doubt this is any improvement over o3, but perhaps I’m just seeing what I want (and expect) to see.
On the other hand, with an increasing number of data points some kind of exponential task length scaling still seems to hold up.
Seems clear that it is below the “faster” projected reasoning model scaling curve.
It looks like inference time scaling is not panning out to be as useful as some hoped / feared.
Degradation on 80% success task length makes me doubt this is any improvement over o3, but perhaps I’m just seeing what I want (and expect) to see.
On the other hand, with an increasing number of data points some kind of exponential task length scaling still seems to hold up.