Some of that error is correlated between models; they also have versions of the graph with error bars on the trendline and those error bars are notably smaller.
The error bars are also much smaller when you look at the plot on a log-y-axis. Like, in some sense not being able to distinguish a 10-minute time horizon from a 30-minute one is a lot of error, but it’s still very distinct from the one-minute time horizon of the previous generation or the 2-hour time horizon you might expect from the next generation. In other words, when you look at the image you shared, the error bars on o4 mini don’t look so bad, but if you were only looking at models up to o4 mini you’d have zoomed in a bunch and the error bars on o4 mini would be large too.
Also note that to cut the size of the error bars in half you’d need to make ~4x as many tasks, to cut it by 4x you’d need ~16x as many tasks. And you’d need to be very confident the tasks weren’t buggy, so just throwing money at the wall and hiring lots of people won’t work because you’ll just get a bunch of tasks you won’t have confidence in.
Keep in mind the opportunity cost is real though, and the main blocker on orgs like METR usually is more like talent/capacity than money. It would be great if they had capacity for this and you’re right that it is insane that humanity doesn’t have better benchmarks. But there’s a dozen other fires at least that large that METR seems to be trying to address, like RCTs to see if AI is actually speeding people up and risk report reviews to see if AIs are actually safe. Perhaps you think these are less important, but if so I would like to hear that argument.
All that said, my understanding is METR is working on this. I would also love to see this type of work from others!
Some of that error is correlated between models; they also have versions of the graph with error bars on the trendline and those error bars are notably smaller.
The error bars are also much smaller when you look at the plot on a log-y-axis. Like, in some sense not being able to distinguish a 10-minute time horizon from a 30-minute one is a lot of error, but it’s still very distinct from the one-minute time horizon of the previous generation or the 2-hour time horizon you might expect from the next generation. In other words, when you look at the image you shared, the error bars on o4 mini don’t look so bad, but if you were only looking at models up to o4 mini you’d have zoomed in a bunch and the error bars on o4 mini would be large too.
Also note that to cut the size of the error bars in half you’d need to make ~4x as many tasks, to cut it by 4x you’d need ~16x as many tasks. And you’d need to be very confident the tasks weren’t buggy, so just throwing money at the wall and hiring lots of people won’t work because you’ll just get a bunch of tasks you won’t have confidence in.
Keep in mind the opportunity cost is real though, and the main blocker on orgs like METR usually is more like talent/capacity than money. It would be great if they had capacity for this and you’re right that it is insane that humanity doesn’t have better benchmarks. But there’s a dozen other fires at least that large that METR seems to be trying to address, like RCTs to see if AI is actually speeding people up and risk report reviews to see if AIs are actually safe. Perhaps you think these are less important, but if so I would like to hear that argument.
All that said, my understanding is METR is working on this. I would also love to see this type of work from others!