tdko comments on tdko’s Shortform

tdko 1 Jul 2025 19:38 UTC
13 points
0
METR’s task length horizon analysis for Claude 4 Opus is out. The 50% task success chance is at 80 minutes, slightly worse than o3′s 90 minutes. The 80% task success chance is tied with o3 at 20 minutes.
https://x.com/METR_Evals/status/1940088546385436738
- Cole Wyeth 1 Jul 2025 19:56 UTC
  6 points
  2
  Parent
  That looks like (minor) good news… appears more consistent with the slower trendline before reasoning models. Is Claude 4 Opus using a comparable amount of inference-time compute as o3?
  
  I believe I predicted that models would fall behind even the slower exponential trendline (before inference time scaling) - before reaching 8-16 hour tasks. So far that hasn’t happened, but obviously it hasn’t resolved either.