Thomas Kwa comments on METR: Measuring AI Ability to Complete Long Tasks

Thomas Kwa 24 Mar 2025 23:24 UTC
4 points
0
External validity is a huge concern, so we don’t claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 “Systematic differences between our tasks and real tasks”. The HCAST paper also has a better description of the dataset.
We didn’t try to make the dataset a perfectly stratified sample of tasks meeting that description, but there is enough diversity in the dataset that I’m much more concerned about relevance of HCAST-like tasks to real life than relevance of HCAST to the universe of HCAST-like tasks.
What links here?
- Samuel Albanie's comment on METR: Measuring AI Ability to Complete Long Tasks by Zach Stein-Perlman (26 Mar 2025 20:01 UTC; 4 points)