Julian Bradshaw comments on METR: Measuring AI Ability to Complete Long Tasks

Julian Bradshaw 19 Mar 2025 19:24 UTC
9 points
0
Here’s an interesting thread of tweets from one of the paper’s authors, Elizabeth Barnes.
Quoting the key sections:
Extrapolating this suggests that within about 5 years we will have generalist AI systems that can autonomously complete ~any software or research engineering task that a human professional could do in a few days, as well as a non-trivial fraction of multi-year projects, with no human assistance or task-specific adaptations required.

However, (...) It’s unclear how to interpret “time needed for humans”, given that this varies wildly between different people, and is highly sensitive to expertise, existing context and experience with similar tasks. For short tasks especially, it makes a big difference whether “time to get set up and familiarized with the problem” is counted as part of the task or not.

(...)

We’ve tried to operationalize the reference human as: a new hire, contractor or consultant; who has no prior knowledge or experience of this particular task/codebase/research question; but has all the relevant background knowledge, and is familiar with any core frameworks / tools / techniques needed.

This hopefully is predictive of agent performance (given that models have likely memorized most of the relevant background information, but won’t have training data on most individual tasks or projects), whilst maintaining an interpretable meaning (it’s hopefully intuitive what a new hire or contractor can do in 10 mins vs 4hrs vs 1 week).

(...)

Some reasons we might be *underestimating* model capabilities include a subtlety around how we calculate human time. In calculating human baseline time, we only use successful baselines. However, a substantial fraction of baseline attempts result in failure. If we use human success rates to estimate the time horizon of our average baseliner, using the same methodology as for models, this comes out to around 1hr—suggesting that current models will soon surpass human performance. (However, we think that baseliner failure rates are artificially high due to our incentive scheme, so this human horizon number is probably significantly too low)

Other reasons include: For tasks that both can complete, models are almost always much cheaper, and much faster in wall-clock time, than humans. This also means that there’s a lot of headroom to spend more compute at test time if we have ways to productively use it—e.g. BoK
That bit at the end about “time horizon of our average baseliner” is a little confusing to me, but I understand it to mean “if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can’t reliably perform tasks that take longer than an hour”. Which is a pretty interesting point.
- Thomas Kwa 21 Mar 2025 1:45 UTC
  5 points
  2
  Parent
  That bit at the end about “time horizon of our average baseliner” is a little confusing to me, but I understand it to mean “if we used the 50% reliability metric on the humans we had do these tasks, our model would say humans can’t reliably perform tasks that take longer than an hour”. Which is a pretty interesting point.
  That’s basically correct. To give a little more context for why we don’t really believe this number, during data collection we were not really trying to measure the human success rate, just get successful human runs and measure their time. It was very common for baseliners to realize that finishing the task would take too long, give up, and try to collect speed bonuses on other tasks. This is somewhat concerning for biasing the human time-to-complete estimates, but much more concerning for this human time horizon measurement. So we don’t claim the human time horizon as a result.