chelsea comments on METR: Measuring AI Ability to Complete Long Tasks

chelsea 13 Apr 2025 18:20 UTC
12 points
0
I think this criticism is wrong—if it were true, the across-dataset correlation between time and LLM-difficulty should be higher than the within-dataset correlation, but from eyeballing Figure 4 (page 10), it looks like it’s not higher (or at least not much).
It is much higher. ~~I’m not sure how/if I can post images of the graph here, but~~ the R^2 for SWAA only is 0.27, HCAST only is 0.48, and RE-bench only is 0.01.

Also, HCAST R^2 goes down to 0.41 if you exclude the ²¹⁄₉₇ data points where the human time source is an estimate. I’m not really sure why these are included in the paper—it seems bizarre to me to extend these any credence.
I think “human time to complete” is a poor proxy of what they’re actually measuring here, and a lot of it is actually explained by what types of tasks are included for each time length. For example, doubling or quadrupling the amount of time a human would need to write a script that transforms JSON data (by adding a lot more fields without making the fields much more complex) doesn’t seem to affect success rates nearly as much as this paper would predict.
- Xodarap 15 Apr 2025 1:07 UTC
  2 points
  0
  Parent
  Note that the REBench correlation definitionally has to be 0 because all tasks have the same length. SWAA similarly has range restriction, though not as severe.
  - chelsea 15 Apr 2025 15:59 UTC
    1 point
    1
    Parent
    Well, the REBench tasks don’t all have the same length, at least in the data METR is using. It’s all tightly clustered around 8 hours though, so I take your point that it’s not a very meaningful correlation.
- MichaelDickens 14 Apr 2025 22:36 UTC
  2 points
  0
  Parent
  Thanks, that’s useful info!
  
  I thought you could post images by dragging and dropping files into the comment box, I seem to recall doing that in the past, but now it doesn’t seem to work for me. Maybe that only works for top-level posts?
  - habryka 14 Apr 2025 22:46 UTC
    3 points
    1
    Parent
    Maybe you switched to the Markdown editor at some point. It still works in the (default) WYSIWYG editor.