Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it “beating” by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn’t SOTA).
Very wide confidence intervals. If Grok 4 were equal to O3 in 50%, time horizon, it “beating” by this much is a 33% outcome. (On the other hand, losing by this amount in the 80% bucket is a 32% outcome).
Overall, I read this as about equally agentic as O3. Possibly slightly less so given the lack of swe-bench scores published for it (suggesting it wasn’t SOTA).