LWLW comments on METR: Measuring AI Ability to Complete Long Tasks

LWLW 8 Apr 2025 0:01 UTC
1 point
0
How does this account for the difficulty of the tasks? AFAIK even reasoning models still struggle with matrix reasoning. And most matrix puzzles (even difficult ones) are something you can do in 15-30 seconds, occasionally 4-5 minutes for sufficiently challenging ones. But even in those cases you usually figure out what to look for in the first 30-60 seconds and then spend the rest of the time on drudge.
So current agents might be capable of the 1 minute task “write a hello world program,” while not being capable of the 1 minute task “solve the final puzzle on Mensa DK.”

And if that’s the case, then agents might be capable of routine long-horizon tasks in the future (whatever that means), while still being incapable of more OOD achievements like “write Attention is all you need.”
What am I missing?
- Zach Stein-Perlman 8 Apr 2025 0:23 UTC
  2 points
  0
  Parent
  My guess:
  This is about software tasks, or specifically “well-defined, low-context, measurable software tasks that can be done without a GUI.” It doesn’t directly generalize to solving puzzles or writing important papers. It probably does generalize within that narrow category.
  If this was trying to measure all tasks, tasks that AIs can’t do would count toward the failure rate; the main graph is about 50% success rate, not 100%. If we were worried that this is misleading because AIs are differentially bad at crucial tasks or something, we could look at success rate on those tasks specifically.
  - LWLW 8 Apr 2025 0:37 UTC
    1 point
    0
    Parent
    This is an uncharitable interpretation, but “good at increasingly long tasks which require no real cleverness” seems economically valuable, but doesn’t seem to be leading to what I think of as superintelligence.