mattmacdermott comments on METR: Measuring AI Ability to Complete Long Tasks

mattmacdermott 24 Mar 2025 23:07 UTC
2 points
0
I think the commenter is asking something a bit different—about the distribution of tasks rather than the success rate. My variant of this question: is your set of tasks supposed to be an unbiased sample of the tasks a knowledge worker faces, so that if I see a 50% success rate on 1 hour long tasks I can read it as a 50% success rate on average across all of the tasks any knowledge worker faces?

Or is it more impressive than that because the tasks are selected to be moderately interesting, or less impressive because they’re selected to be measurable, etc
- Thomas Kwa 24 Mar 2025 23:24 UTC
  4 points
  0
  Parent
  External validity is a huge concern, so we don’t claim anything as ambitious as average knowledge worker tasks. In one sentence, my opinion is that our tasks suite is fairly representative of well-defined, low-context, measurable software tasks that can be done without a GUI. More speculatively, horizons on this are probably within a large (~10x) constant factor of horizons on most other software tasks. We have a lot more discussion of this in the paper, especially in heading 7.2.1 “Systematic differences between our tasks and real tasks”. The HCAST paper also has a better description of the dataset.
  We didn’t try to make the dataset a perfectly stratified sample of tasks meeting that description, but there is enough diversity in the dataset that I’m much more concerned about relevance of HCAST-like tasks to real life than relevance of HCAST to the universe of HCAST-like tasks.
  What links here?
  - Samuel Albanie's comment on METR: Measuring AI Ability to Complete Long Tasks by Zach Stein-Perlman (26 Mar 2025 20:01 UTC; 4 points)