I was surprised to learn recently that the error bars on the METR time horizon chart are this large. This is probably the most important capabilities benchmark right now[1], but I don’t think it’s precise enough to be useful for discussions about AI capabilities progress or RSI.
Why hasn’t METR added more long-horizon tasks to their benchmark since it was released in March 2025? I think they could probably find funding to do this from the labs or EA donors.
I think they are working on adding new tasks? Not sure. Apparently it’s hard. This concerns me greatly too, because basically their existing benchmark is about to get saturated and we’ll be flying blind again.
My hope is that the entire AI benchmarks industry/literature will reform itself and pick up the ideas METR introduced. Imagine:
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample. --They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be. --It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
If the industry/literature adopts these practices, then every benchmark basically becomes a horizon length benchmark. We can do a giant metaanalysis that aggregates it all together. Error bars will shrink. And The Graph will continue marching on through 2026 and 2027 instead of being saturated and forgotten.
No? I contributed a ~20hr task to them and it was pretty easy actually? I’ve been making benchmark-shaped things on and off for the past five years, for free, as a hobby?
(Most of the effort my end was getting it METR’s required format, recruiting & managing my playtester, and contemplating whether I was complicit in intellectual fraud[1]; if they’d made those things easier or handled them themselves I’d have made more; IIRC the actual “make a ~20hr task” part took me <20hrs.)
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample. --They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be. --It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
I agree emphatically with all the above and raise you
--Saturated benchmarks & benchmark components are released publicly as a matter of course, so people can independently confirm the time horizons are where they were claimed to be.
--‘Centaur’ time horizons (“how hard is this task for a smart human with SoTA LLM assistance?”) are reported alongside ‘pure’ time horizons (“how hard is this task for a smart human on their own?”).
A miscommunication (ETA: miscommunication was probably at least 50% a me problem) led me to believe they weren’t going to Baseline tasks at all, and were relying solely on the estimated times provided by task-makers and playtesters (i.e. people with a financial and ideological stake in reporting larger numbers), instead of using the more complex and less dubious protocol they actually went with; this combined with my less serious qualms led me to call it quits before building the other scenarios I had planned for them.
. . . I realize the start of this post reads like a weird brag but imo it really isn’t. “Hey failed-wannabe-gamedev, I need a bunch of puzzles and it’s ok if they’re not very fun and it’s ok if there’s no UI and it’s actively preferable if they’re ridiculously complicated and time-consuming and spreadsheet-requiring and reminiscent-of-someone’s-dayjob, we’re paying a couple grand apiece” is a pitch I imagine a lot of people would be willing and able to jump at, many much moreso than me.
Why do you think METR hasn’t built more tasks then, if it’s easy?
I have no idea, I just don’t think the “actually making the tasks” part can be the limiting factor.
I take it you have a negative opinion of them?
Yes; I also have a positive opinion of them, and various neutral opinions of them.
(My position could be summed up as “the concept of time horizons was really good & important, and their work is net positive, but it could use much stronger methodological underpinning and is currently being leaned on too heavily by too many people”; I’m given to understand that’s also their position on themselves.)
OK. Yeah that’s also my opinion too. Maybe I am one of the people leaning too heavily on their work. The problem is, there isn’t much else to go on. “The worst benchmark for predicting AGI, except for all the others.”
Fwiw I think the AI village is at least as good of a benchmark for predicting AGI! Of course it’s harder to quantify progress in the village, but it’s very helpful for developing intuitions.
Except that there already is the Epoch Capability Index (which aggregates an army of benchmarks) and the ARC-AGI benchmark (which, alas, is also on track to saturation) where the human baseline is decoupled from the time horizon because it relies on visual intelligence (or, in the case of the AIs, on the ability to notice patterns). As for the METR benchmark being saturated[1], maybe Claude Opus 4.5 is an outlier whose TH was gamed with? Or there is a benign explanation, like Claude failing on primitive tasks in a manner similar to Grok 4 and to Claude’s performance on ARC-AGI-1 failing to form a straight line?
Were the o3-GPT5.1CodexMax trend to continue forever, the 8hr 50% time horizon would be reached in September 2026. IIRC the benchmark doesn’t have tasks lasting longer than 8hrs, and the horizon would be saturated only by then. Alas, the time horizon is likely exponential until the very last couple of doublings.
you should be more uncertain about the METR benchmark’s external validity than what these error bars show.
but your baseline uncertainty about key facts about AI progress in general should also often span much more than one order of magnitude between your 2.5th percentile and 97.5th percentile guess. the METR results add a lot of value and I don’t think these error bars are a big deal in the scheme of things.
I agree, a lot of my uncertainty is on its external validity, and also the degree to which the models are being bench-maxed for the tasks in the benchmark. But I still think it’s reasonable to expect the statistical confidence intervals of individual models to be less wide than a factor of 10. It’s important to be able to distinguish possible changes to the trend from statistical artifacts. This seems solvable with additional tasks and more human testing.
Some of that error is correlated between models; they also have versions of the graph with error bars on the trendline and those error bars are notably smaller.
The error bars are also much smaller when you look at the plot on a log-y-axis. Like, in some sense not being able to distinguish a 10-minute time horizon from a 30-minute one is a lot of error, but it’s still very distinct from the one-minute time horizon of the previous generation or the 2-hour time horizon you might expect from the next generation. In other words, when you look at the image you shared, the error bars on o4 mini don’t look so bad, but if you were only looking at models up to o4 mini you’d have zoomed in a bunch and the error bars on o4 mini would be large too.
Also note that to cut the size of the error bars in half you’d need to make ~4x as many tasks, to cut it by 4x you’d need ~16x as many tasks. And you’d need to be very confident the tasks weren’t buggy, so just throwing money at the wall and hiring lots of people won’t work because you’ll just get a bunch of tasks you won’t have confidence in.
Keep in mind the opportunity cost is real though, and the main blocker on orgs like METR usually is more like talent/capacity than money. It would be great if they had capacity for this and you’re right that it is insane that humanity doesn’t have better benchmarks. But there’s a dozen other fires at least that large that METR seems to be trying to address, like RCTs to see if AI is actually speeding people up and risk report reviews to see if AIs are actually safe. Perhaps you think these are less important, but if so I would like to hear that argument.
All that said, my understanding is METR is working on this. I would also love to see this type of work from others!
I was surprised to learn recently that the error bars on the METR time horizon chart are this large. This is probably the most important capabilities benchmark right now[1], but I don’t think it’s precise enough to be useful for discussions about AI capabilities progress or RSI.
Why hasn’t METR added more long-horizon tasks to their benchmark since it was released in March 2025? I think they could probably find funding to do this from the labs or EA donors.
E.g. as Daniel Kokotajlo has argued here, and as Benjamin Todd has argued here.
I think they are working on adding new tasks? Not sure. Apparently it’s hard. This concerns me greatly too, because basically their existing benchmark is about to get saturated and we’ll be flying blind again.
My hope is that the entire AI benchmarks industry/literature will reform itself and pick up the ideas METR introduced. Imagine:
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample.
--They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be.
--It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
If the industry/literature adopts these practices, then every benchmark basically becomes a horizon length benchmark. We can do a giant metaanalysis that aggregates it all together. Error bars will shrink. And The Graph will continue marching on through 2026 and 2027 instead of being saturated and forgotten.
No? I contributed a ~20hr task to them and it was pretty easy actually? I’ve been making benchmark-shaped things on and off for the past five years, for free, as a hobby?
(Most of the effort my end was getting it METR’s required format, recruiting & managing my playtester, and contemplating whether I was complicit in intellectual fraud[1]; if they’d made those things easier or handled them themselves I’d have made more; IIRC the actual “make a ~20hr task” part took me <20hrs.)
I agree emphatically with all the above and raise you
--Saturated benchmarks & benchmark components are released publicly as a matter of course, so people can independently confirm the time horizons are where they were claimed to be.
--‘Centaur’ time horizons (“how hard is this task for a smart human with SoTA LLM assistance?”) are reported alongside ‘pure’ time horizons (“how hard is this task for a smart human on their own?”).
A miscommunication (ETA: miscommunication was probably at least 50% a me problem) led me to believe they weren’t going to Baseline tasks at all, and were relying solely on the estimated times provided by task-makers and playtesters (i.e. people with a financial and ideological stake in reporting larger numbers), instead of using the more complex and less dubious protocol they actually went with; this combined with my less serious qualms led me to call it quits before building the other scenarios I had planned for them.
. . . I realize the start of this post reads like a weird brag but imo it really isn’t. “Hey failed-wannabe-gamedev, I need a bunch of puzzles and it’s ok if they’re not very fun and it’s ok if there’s no UI and it’s actively preferable if they’re ridiculously complicated and time-consuming and spreadsheet-requiring and reminiscent-of-someone’s-dayjob, we’re paying a couple grand apiece” is a pitch I imagine a lot of people would be willing and able to jump at, many much moreso than me.
I like your raises!
Why do you think METR hasn’t built more tasks then, if it’s easy? I take it you have a negative opinion of them?
I have no idea, I just don’t think the “actually making the tasks” part can be the limiting factor.
Yes; I also have a positive opinion of them, and various neutral opinions of them.
(My position could be summed up as “the concept of time horizons was really good & important, and their work is net positive, but it could use much stronger methodological underpinning and is currently being leaned on too heavily by too many people”; I’m given to understand that’s also their position on themselves.)
OK. Yeah that’s also my opinion too. Maybe I am one of the people leaning too heavily on their work. The problem is, there isn’t much else to go on. “The worst benchmark for predicting AGI, except for all the others.”
Fwiw I think the AI village is at least as good of a benchmark for predicting AGI! Of course it’s harder to quantify progress in the village, but it’s very helpful for developing intuitions.
Except that there already is the Epoch Capability Index (which aggregates an army of benchmarks) and the ARC-AGI benchmark (which, alas, is also on track to saturation) where the human baseline is decoupled from the time horizon because it relies on visual intelligence (or, in the case of the AIs, on the ability to notice patterns). As for the METR benchmark being saturated[1], maybe Claude Opus 4.5 is an outlier whose TH was gamed with? Or there is a benign explanation, like Claude failing on primitive tasks in a manner similar to Grok 4 and to Claude’s performance on ARC-AGI-1 failing to form a straight line?
Were the o3-GPT5.1CodexMax trend to continue forever, the 8hr 50% time horizon would be reached in September 2026. IIRC the benchmark doesn’t have tasks lasting longer than 8hrs, and the horizon would be saturated only by then. Alas, the time horizon is likely exponential until the very last couple of doublings.
you should be more uncertain about the METR benchmark’s external validity than what these error bars show.
but your baseline uncertainty about key facts about AI progress in general should also often span much more than one order of magnitude between your 2.5th percentile and 97.5th percentile guess. the METR results add a lot of value and I don’t think these error bars are a big deal in the scheme of things.
I agree, a lot of my uncertainty is on its external validity, and also the degree to which the models are being bench-maxed for the tasks in the benchmark. But I still think it’s reasonable to expect the statistical confidence intervals of individual models to be less wide than a factor of 10. It’s important to be able to distinguish possible changes to the trend from statistical artifacts. This seems solvable with additional tasks and more human testing.
Some of that error is correlated between models; they also have versions of the graph with error bars on the trendline and those error bars are notably smaller.
The error bars are also much smaller when you look at the plot on a log-y-axis. Like, in some sense not being able to distinguish a 10-minute time horizon from a 30-minute one is a lot of error, but it’s still very distinct from the one-minute time horizon of the previous generation or the 2-hour time horizon you might expect from the next generation. In other words, when you look at the image you shared, the error bars on o4 mini don’t look so bad, but if you were only looking at models up to o4 mini you’d have zoomed in a bunch and the error bars on o4 mini would be large too.
Also note that to cut the size of the error bars in half you’d need to make ~4x as many tasks, to cut it by 4x you’d need ~16x as many tasks. And you’d need to be very confident the tasks weren’t buggy, so just throwing money at the wall and hiring lots of people won’t work because you’ll just get a bunch of tasks you won’t have confidence in.
Keep in mind the opportunity cost is real though, and the main blocker on orgs like METR usually is more like talent/capacity than money. It would be great if they had capacity for this and you’re right that it is insane that humanity doesn’t have better benchmarks. But there’s a dozen other fires at least that large that METR seems to be trying to address, like RCTs to see if AI is actually speeding people up and risk report reviews to see if AIs are actually safe. Perhaps you think these are less important, but if so I would like to hear that argument.
All that said, my understanding is METR is working on this. I would also love to see this type of work from others!