Nikola Jurkovic comments on nikola’s Shortform

Nikola Jurkovic 31 Jul 2025 2:19 UTC
18 points
3
Grok 4 is slightly above SOTA on 50% time horizon and slightly below SOTA on 80% time horizon: https://x.com/METR_Evals/status/1950740117020389870
- Cole Wyeth 31 Jul 2025 14:39 UTC
  4 points
  0
  Parent
  Seems clear that it is below the “faster” projected reasoning model scaling curve.
  It looks like inference time scaling is not panning out to be as useful as some hoped / feared.
  Degradation on 80% success task length makes me doubt this is any improvement over o3, but perhaps I’m just seeing what I want (and expect) to see.
  On the other hand, with an increasing number of data points some kind of exponential task length scaling still seems to hold up.
- Aaron Staley 31 Jul 2025 6:19 UTC
  3 points
  0
  Parent
  Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it “beating” by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
  Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn’t SOTA).
- tdko 31 Jul 2025 16:38 UTC
  1 point
  0
  Parent
  Did we ever get any clarification as to whether Grok 4 did in fact use as much compute on posttraining as pretraining?