Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it “beating” by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn’t SOTA).
Grok 4 is slightly above SOTA on 50% time horizon and slightly below SOTA on 80% time horizon: https://x.com/METR_Evals/status/1950740117020389870
Seems clear that it is below the “faster” projected reasoning model scaling curve.
It looks like inference time scaling is not panning out to be as useful as some hoped / feared.
Degradation on 80% success task length makes me doubt this is any improvement over o3, but perhaps I’m just seeing what I want (and expect) to see.
On the other hand, with an increasing number of data points some kind of exponential task length scaling still seems to hold up.
Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it “beating” by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn’t SOTA).
Did we ever get any clarification as to whether Grok 4 did in fact use as much compute on posttraining as pretraining?