I find the METR task length measurements very helpful for reasoning about timelines. However, they also seem like approximately the last benchmark I want the labs to be optimizing: high agency over long timescales, and also specifically focused on tasks relevant to recursive self-improvement. I’m sure that the frontier labs can (and do) run similar internal evals, but raising the salience of this metric (as a comparison point between models from different companies) seems risky. What has METR said about this?
Finally, we’ve optimized the Long Horizon Software Development capability, from the famous METR graph “Don’t Optimize The Long Horizon Software Development Capability”
Not speaking for anyone else at METR, but I personally think it’s inherently difficult to raise the salience of something like time horizon during a period of massive hype without creating some degree of hype about the benchmark, and the overall project impact is still highly net positive.
Basically, companies already believe in and explicitly aim for recursive self-improvement, but the public doesn’t. Therefore, we want to tell the public what labs already believe—that RSI could be technically feasible within a few years, that current AIs can do things that take humans a couple of hours under favorable conditions, and that there’s a somewhat consistent trend. We help the public make use of this info to reduce risks, eg by communicating with policymakers and helping companies formulate RSPs, which boosts the ratio of benefit to cost.
You might still think: how large is the cost? Well, the world would look pretty different if investment towards RSI were the primary effect of the time horizon work. Companies would be asking METR how to make models more agentic, enterprise deals would be decided based on time horizon, and we’d see leaked or public roadmaps from companies aiming for 16 hour time horizons by Q2. (Being able to plan for the next node is how Moore’s Law likely sped up semiconductor progress; this is much more difficult to do for time horizon for various reasons.) Also, the amount of misbehavior—especially power-seeking—from more agency has been a bit below my expectations, so it’s unlikely we’ll push things above a near term danger threshold.
If we want to create less pressure towards RSI, it’s not clear what to do. There are some choices we made in the original paper and the current website, like not color-coding the models by company, discussing risks in several sections of the paper, not publishing a leaderboard, keeping many tasks private (though this is largely for benchmark integrity) and adding various caveats and follow-up studies. More drastic options include releasing numbers less frequently, making a worse benchmark, or doing less publicity, and none of these seem appealing in the current environment, although they might become so in the future.
I wouldn’t say that people in labs don’t care about benchmarks but I think the perception of the degree we care about it is exaggerated. Frontier labs are now a multi billion business with hundreds of millions of users. A normal user trying to decide if to use a model from provider A or B doesn’t know or care about benchmark results.
We do have reasons to care about of horizon tasks in general and tasks related to AI R&D in particular (as we have been open about) but the METR benchmark has nothing to do with it.
I find the METR task length measurements very helpful for reasoning about timelines. However, they also seem like approximately the last benchmark I want the labs to be optimizing: high agency over long timescales, and also specifically focused on tasks relevant to recursive self-improvement. I’m sure that the frontier labs can (and do) run similar internal evals, but raising the salience of this metric (as a comparison point between models from different companies) seems risky. What has METR said about this?
Finally, we’ve optimized the Long Horizon Software Development capability, from the famous METR graph “Don’t Optimize The Long Horizon Software Development Capability”
Not speaking for anyone else at METR, but I personally think it’s inherently difficult to raise the salience of something like time horizon during a period of massive hype without creating some degree of hype about the benchmark, and the overall project impact is still highly net positive.
Basically, companies already believe in and explicitly aim for recursive self-improvement, but the public doesn’t. Therefore, we want to tell the public what labs already believe—that RSI could be technically feasible within a few years, that current AIs can do things that take humans a couple of hours under favorable conditions, and that there’s a somewhat consistent trend. We help the public make use of this info to reduce risks, eg by communicating with policymakers and helping companies formulate RSPs, which boosts the ratio of benefit to cost.
You might still think: how large is the cost? Well, the world would look pretty different if investment towards RSI were the primary effect of the time horizon work. Companies would be asking METR how to make models more agentic, enterprise deals would be decided based on time horizon, and we’d see leaked or public roadmaps from companies aiming for 16 hour time horizons by Q2. (Being able to plan for the next node is how Moore’s Law likely sped up semiconductor progress; this is much more difficult to do for time horizon for various reasons.) Also, the amount of misbehavior—especially power-seeking—from more agency has been a bit below my expectations, so it’s unlikely we’ll push things above a near term danger threshold.
If we want to create less pressure towards RSI, it’s not clear what to do. There are some choices we made in the original paper and the current website, like not color-coding the models by company, discussing risks in several sections of the paper, not publishing a leaderboard, keeping many tasks private (though this is largely for benchmark integrity) and adding various caveats and follow-up studies. More drastic options include releasing numbers less frequently, making a worse benchmark, or doing less publicity, and none of these seem appealing in the current environment, although they might become so in the future.
Seems reasonable.
I wouldn’t say that people in labs don’t care about benchmarks but I think the perception of the degree we care about it is exaggerated. Frontier labs are now a multi billion business with hundreds of millions of users. A normal user trying to decide if to use a model from provider A or B doesn’t know or care about benchmark results.
We do have reasons to care about of horizon tasks in general and tasks related to AI R&D in particular (as we have been open about) but the METR benchmark has nothing to do with it.
See Paul Christiano’s Thoughts on sharing information about language model capabilities (back when METR was ARC Evals).