Zach Stein-Perlman comments on METR: Measuring AI Ability to Complete Long Tasks

Zach Stein-Perlman 8 Apr 2025 0:23 UTC
2 points
0
My guess:
This is about software tasks, or specifically “well-defined, low-context, measurable software tasks that can be done without a GUI.” It doesn’t directly generalize to solving puzzles or writing important papers. It probably does generalize within that narrow category.
If this was trying to measure all tasks, tasks that AIs can’t do would count toward the failure rate; the main graph is about 50% success rate, not 100%. If we were worried that this is misleading because AIs are differentially bad at crucial tasks or something, we could look at success rate on those tasks specifically.
- LWLW 8 Apr 2025 0:37 UTC
  1 point
  0
  Parent
  This is an uncharitable interpretation, but “good at increasingly long tasks which require no real cleverness” seems economically valuable, but doesn’t seem to be leading to what I think of as superintelligence.