I think they are working on adding new tasks? Not sure. Apparently it’s hard. This concerns me greatly too, because basically their existing benchmark is about to get saturated and we’ll be flying blind again.
My hope is that the entire AI benchmarks industry/literature will reform itself and pick up the ideas METR introduced. Imagine:
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample. --They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be. --It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
If the industry/literature adopts these practices, then every benchmark basically becomes a horizon length benchmark. We can do a giant metaanalysis that aggregates it all together. Error bars will shrink. And The Graph will continue marching on through 2026 and 2027 instead of being saturated and forgotten.
No? I contributed a ~20hr task to them and it was pretty easy actually? I’ve been making benchmark-shaped things on and off for the past five years, for free, as a hobby?
(Most of the effort my end was getting it METR’s required format, recruiting & managing my playtester, and contemplating whether I was complicit in intellectual fraud[1]; if they’d made those things easier or handled them themselves I’d have made more; IIRC the actual “make a ~20hr task” part took me <20hrs.)
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample. --They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be. --It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
I agree emphatically with all the above and raise you
--Saturated benchmarks & benchmark components are released publicly as a matter of course, so people can independently confirm the time horizons are where they were claimed to be.
--‘Centaur’ time horizons (“how hard is this task for a smart human with SoTA LLM assistance?”) are reported alongside ‘pure’ time horizons (“how hard is this task for a smart human on their own?”).
A miscommunication (ETA: miscommunication was probably at least 50% a me problem) led me to believe they weren’t going to Baseline tasks at all, and were relying solely on the estimated times provided by task-makers and playtesters (i.e. people with a financial and ideological stake in reporting larger numbers), instead of using the more complex and less dubious protocol they actually went with; this combined with my less serious qualms led me to call it quits before building the other scenarios I had planned for them.
. . . I realize the start of this post reads like a weird brag but imo it really isn’t. “Hey failed-wannabe-gamedev, I need a bunch of puzzles and it’s ok if they’re not very fun and it’s ok if there’s no UI and it’s actively preferable if they’re ridiculously complicated and time-consuming and spreadsheet-requiring and reminiscent-of-someone’s-dayjob, we’re paying a couple grand apiece” is a pitch I imagine a lot of people would be willing and able to jump at, many much moreso than me.
Why do you think METR hasn’t built more tasks then, if it’s easy?
I have no idea, I just don’t think the “actually making the tasks” part can be the limiting factor.
I take it you have a negative opinion of them?
Yes; I also have a positive opinion of them, and various neutral opinions of them.
(My position could be summed up as “the concept of time horizons was really good & important, and their work is net positive, but it could use much stronger methodological underpinning and is currently being leaned on too heavily by too many people”; I’m given to understand that’s also their position on themselves.)
OK. Yeah that’s also my opinion too. Maybe I am one of the people leaning too heavily on their work. The problem is, there isn’t much else to go on. “The worst benchmark for predicting AGI, except for all the others.”
Fwiw I think the AI village is at least as good of a benchmark for predicting AGI! Of course it’s harder to quantify progress in the village, but it’s very helpful for developing intuitions.
Except that there already is the Epoch Capability Index (which aggregates an army of benchmarks) and the ARC-AGI benchmark (which, alas, is also on track to saturation) where the human baseline is decoupled from the time horizon because it relies on visual intelligence (or, in the case of the AIs, on the ability to notice patterns). As for the METR benchmark being saturated[1], maybe Claude Opus 4.5 is an outlier whose TH was gamed with? Or there is a benign explanation, like Claude failing on primitive tasks in a manner similar to Grok 4 and to Claude’s performance on ARC-AGI-1 failing to form a straight line?
Were the o3-GPT5.1CodexMax trend to continue forever, the 8hr 50% time horizon would be reached in September 2026. IIRC the benchmark doesn’t have tasks lasting longer than 8hrs, and the horizon would be saturated only by then. Alas, the time horizon is likely exponential until the very last couple of doublings.
I think they are working on adding new tasks? Not sure. Apparently it’s hard. This concerns me greatly too, because basically their existing benchmark is about to get saturated and we’ll be flying blind again.
My hope is that the entire AI benchmarks industry/literature will reform itself and pick up the ideas METR introduced. Imagine:
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample.
--They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be.
--It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
If the industry/literature adopts these practices, then every benchmark basically becomes a horizon length benchmark. We can do a giant metaanalysis that aggregates it all together. Error bars will shrink. And The Graph will continue marching on through 2026 and 2027 instead of being saturated and forgotten.
No? I contributed a ~20hr task to them and it was pretty easy actually? I’ve been making benchmark-shaped things on and off for the past five years, for free, as a hobby?
(Most of the effort my end was getting it METR’s required format, recruiting & managing my playtester, and contemplating whether I was complicit in intellectual fraud[1]; if they’d made those things easier or handled them themselves I’d have made more; IIRC the actual “make a ~20hr task” part took me <20hrs.)
I agree emphatically with all the above and raise you
--Saturated benchmarks & benchmark components are released publicly as a matter of course, so people can independently confirm the time horizons are where they were claimed to be.
--‘Centaur’ time horizons (“how hard is this task for a smart human with SoTA LLM assistance?”) are reported alongside ‘pure’ time horizons (“how hard is this task for a smart human on their own?”).
A miscommunication (ETA: miscommunication was probably at least 50% a me problem) led me to believe they weren’t going to Baseline tasks at all, and were relying solely on the estimated times provided by task-makers and playtesters (i.e. people with a financial and ideological stake in reporting larger numbers), instead of using the more complex and less dubious protocol they actually went with; this combined with my less serious qualms led me to call it quits before building the other scenarios I had planned for them.
. . . I realize the start of this post reads like a weird brag but imo it really isn’t. “Hey failed-wannabe-gamedev, I need a bunch of puzzles and it’s ok if they’re not very fun and it’s ok if there’s no UI and it’s actively preferable if they’re ridiculously complicated and time-consuming and spreadsheet-requiring and reminiscent-of-someone’s-dayjob, we’re paying a couple grand apiece” is a pitch I imagine a lot of people would be willing and able to jump at, many much moreso than me.
I like your raises!
Why do you think METR hasn’t built more tasks then, if it’s easy? I take it you have a negative opinion of them?
I have no idea, I just don’t think the “actually making the tasks” part can be the limiting factor.
Yes; I also have a positive opinion of them, and various neutral opinions of them.
(My position could be summed up as “the concept of time horizons was really good & important, and their work is net positive, but it could use much stronger methodological underpinning and is currently being leaned on too heavily by too many people”; I’m given to understand that’s also their position on themselves.)
OK. Yeah that’s also my opinion too. Maybe I am one of the people leaning too heavily on their work. The problem is, there isn’t much else to go on. “The worst benchmark for predicting AGI, except for all the others.”
Fwiw I think the AI village is at least as good of a benchmark for predicting AGI! Of course it’s harder to quantify progress in the village, but it’s very helpful for developing intuitions.
Except that there already is the Epoch Capability Index (which aggregates an army of benchmarks) and the ARC-AGI benchmark (which, alas, is also on track to saturation) where the human baseline is decoupled from the time horizon because it relies on visual intelligence (or, in the case of the AIs, on the ability to notice patterns). As for the METR benchmark being saturated[1], maybe Claude Opus 4.5 is an outlier whose TH was gamed with? Or there is a benign explanation, like Claude failing on primitive tasks in a manner similar to Grok 4 and to Claude’s performance on ARC-AGI-1 failing to form a straight line?
Were the o3-GPT5.1CodexMax trend to continue forever, the 8hr 50% time horizon would be reached in September 2026. IIRC the benchmark doesn’t have tasks lasting longer than 8hrs, and the horizon would be saturated only by then. Alas, the time horizon is likely exponential until the very last couple of doublings.