Ronny Fernandez comments on METR: Measuring AI Ability to Complete Long Tasks

Ronny Fernandez 4 Apr 2025 22:06 UTC
6 points
0
Curated. Comparing model performance on tasks to the time human experts need to complete the same tasks (with fixed reliability) is worth highlighting since it helps operationalize terms like “human-level-AI” and “AI-level-of-capabilities” in general. Furthermore, by making this empirical comparison and discovering a 7-month doubling time, this work significantly reduces our uncertainty about both when to expect certain capabilities (and more impressively according to me) how to conceptualize those AI capability levels. That is, on top of reducing our uncertainty, I think this work also provides a good general format / frame for reporting general AI capabilities forecasts, eg, we have X years until models can do things that it takes human experts Y hours to do with reliability Z%.
I also appreciated the discussions this post inspired about whether we should expect the slope in log-space to change, and if so in which direction, as well as the related discussion about whether we should expect this trend to go superexponential. Interesting arguments and models were put forth in both discussions.
I hope in the future METR explores other methods for concretizing/operationalizing and forecasting AI capability levels. For example, comparing human expert reliability in general within specific task domains to model task reliability within those same domains, or comparing the time humans take to become reliable experts in certain domains to model task reliability within those same domains.