Cole Wyeth comments on Cole Wyeth’s Shortform

Cole Wyeth 6 Dec 2025 3:45 UTC
53 points
29
I find the METR task length measurements very helpful for reasoning about timelines. However, they also seem like approximately the last benchmark I want the labs to be optimizing: high agency over long timescales, and also specifically focused on tasks relevant to recursive self-improvement. I’m sure that the frontier labs can (and do) run similar internal evals, but raising the salience of this metric (as a comparison point between models from different companies) seems risky. What has METR said about this?
- Joey KL 6 Dec 2025 4:16 UTC
  51 points
  25
  Parent
  Finally, we’ve optimized the Long Horizon Software Development capability, from the famous METR graph “Don’t Optimize The Long Horizon Software Development Capability”
- Thomas Kwa 6 Dec 2025 23:42 UTC
  38 points
  24
  Parent
  Not speaking for anyone else at METR, but I personally think it’s inherently difficult to raise the salience of something like time horizon during a period of massive hype without creating some degree of hype about the benchmark, and the overall project impact is still highly net positive.
  Basically, companies already believe in and explicitly aim for recursive self-improvement, but the public doesn’t. Therefore, we want to tell the public what labs already believe—that RSI could be technically feasible within a few years, that current AIs can do things that take humans a couple of hours under favorable conditions, and that there’s a somewhat consistent trend. We help the public make use of this info to reduce risks, eg by communicating with policymakers and helping companies formulate RSPs, which boosts the ratio of benefit to cost.
  You might still think: how large is the cost? Well, the world would look pretty different if investment towards RSI were the primary effect of the time horizon work. Companies would be asking METR how to make models more agentic, enterprise deals would be decided based on time horizon, and we’d see leaked or public roadmaps from companies aiming for 16 hour time horizons by Q2. (Being able to plan for the next node is how Moore’s Law likely sped up semiconductor progress; this is much more difficult to do for time horizon for various reasons.) Also, the amount of misbehavior—especially power-seeking—from more agency has been a bit below my expectations, so it’s unlikely we’ll push things above a near term danger threshold.
  If we want to create less pressure towards RSI, it’s not clear what to do. There are some choices we made in the original paper and the current website, like not color-coding the models by company, discussing risks in several sections of the paper, not publishing a leaderboard, keeping many tasks private (though this is largely for benchmark integrity) and adding various caveats and follow-up studies. More drastic options include releasing numbers less frequently, making a worse benchmark, or doing less publicity, and none of these seem appealing in the current environment, although they might become so in the future.
  - Cole Wyeth 7 Dec 2025 1:04 UTC
    2 points
    0
    Parent
    Seems reasonable.
- Boaz Barak 6 Dec 2025 22:22 UTC
  22 points
  0
  Parent
  I wouldn’t say that people in labs don’t care about benchmarks but I think the perception of the degree we care about it is exaggerated. Frontier labs are now a multi billion business with hundreds of millions of users. A normal user trying to decide if to use a model from provider A or B doesn’t know or care about benchmark results.
  
  We do have reasons to care about of horizon tasks in general and tasks related to AI R&D in particular (as we have been open about) but the METR benchmark has nothing to do with it.
- Xodarap 6 Dec 2025 23:37 UTC
  15 points
  0
  Parent
  See Paul Christiano’s Thoughts on sharing information about language model capabilities (back when METR was ARC Evals).