Some of that error is correlated between models; they also have versions of the graph with error bars on the trendline and those error bars are notably smaller.
The error bars are also much smaller when you look at the plot on a log-y-axis. Like, in some sense not being able to distinguish a 10-minute time horizon from a 30-minute one is a lot of error, but it’s still very distinct from the one-minute time horizon of the previous generation or the 2-hour time horizon you might expect from the next generation. In other words, when you look at the image you shared, the error bars on o4 mini don’t look so bad, but if you were only looking at models up to o4 mini you’d have zoomed in a bunch and the error bars on o4 mini would be large too.
Also note that to cut the size of the error bars in half you’d need to make ~4x as many tasks, to cut it by 4x you’d need ~16x as many tasks. And you’d need to be very confident the tasks weren’t buggy, so just throwing money at the wall and hiring lots of people won’t work because you’ll just get a bunch of tasks you won’t have confidence in.
Keep in mind the opportunity cost is real though, and the main blocker on orgs like METR usually is more like talent/capacity than money. It would be great if they had capacity for this and you’re right that it is insane that humanity doesn’t have better benchmarks. But there’s a dozen other fires at least that large that METR seems to be trying to address, like RCTs to see if AI is actually speeding people up and risk report reviews to see if AIs are actually safe. Perhaps you think these are less important, but if so I would like to hear that argument.
All that said, my understanding is METR is working on this. I would also love to see this type of work from others!
It’s useful for evals to be run reliably for every model and maintained for long periods. A lot of the point of safety-relevant evals is to be a building block people can use for other things: they can make forecasts/bets about what models will score on the eval or what will happen if a certain score is reached, they can make commitments about what to do if a model achieves a certain score, they can make legislation that applies only to models with specific scores, and they can advise the world to look to these scores to understand if risk is high.
Much of that falls apart if there’s FUD about whether a given eval will still exist and be run on the relevant models in a year’s time.
This didn’t used to be an issue because evals used to be simple to run; they were just a simple script asking a model a series of multiple-choice questions.
Agentic evals are complex. They require GPUs and containers and scripts that need to be maintained. You need to scaffold your agent and run it for days. Sometimes you need to build a vending machine.
I’m worried about a pattern where a shiny new eval is developed, run for a few months, then discarded in favor of newer, better evals. Or where the folks running the evals don’t get around to running them reliably for every model.
As a concrete example, the 2025 AI Forecasting Survey asked people to forecast what the best model’s score on RE-Bench would be by the end of 2025, but RE-Bench hasn’t been run on Claude Opus 4.5, or on many other recent models (METR focuses on their newer, larger time-horizon eval instead). It also asked for forecasted scores on OS-World, but OS-World isn’t run anymore (it’s been replaced by OSWorld-Verified).
There are real costs to running these evals, and when they’re deprecated, it’s usually because they’re replaced with something better. But I think sometimes people act like this is a completely costless action and I want to point out the costs.