Thane Ruthenis comments on METR: Measuring AI Ability to Complete Long Tasks

Thane Ruthenis 23 Mar 2025 17:31 UTC
15 points
8
Indeed. That seems incredibly weird. It would be one thing if it were a function of parameter size, or FLOPs, or data, or at least the money invested. But the release date?
The reasons why GPT-3, GPT-3.5, GPT-4o, Sonnet 3.6, and o1 improved on the SOTA are all different from each other, ranging from “bigger scale” to “first RLHF’d model” to “first multimodal model” to “algorithmic improvements/better data” to ”???” (Sonnet 3.6) to “first reasoning model”. And it’d be one thing if we could at least say that “for mysterious reasons, billion-dollar corporations trying incredibly hard to advance the frontier can’t do better than doubling the agency horizon every 7 months using any method”, but GPTs from −2 to −3.5 were developed in a completely different socioeconomic situation! There wasn’t an AI race dynamics, AGI companies were much poorer, etc. Yet they’re still part of the pattern.
This basically only leaves teleological explanations, implies a divine plan for the rate of human technological advancement.
Which makes me suspect there’s some error in the data, or the methodology was (accidentally) rigged to produce this result^[1]. Or perhaps there’s a selection bias where tons of people were trying various different ways to forecast AI progress, all methodologies which failed to produce a neat trend weren’t published, and we’re looking at a methodology that chanced upon a spurious correlation.
Or I’m missing something obvious and it actually makes way more sense. Am I missing something obvious?
1. ^
  For example: Was the benchmarking of how long a given type of task takes a human done prior to evaluating AI models, or was it done simultaneously with figuring out which models can do which tasks? I’d assume the methodology was locked in first, but if not...
What links here?
- Thane Ruthenis's comment on Thane Ruthenis’s Shortform by Thane Ruthenis (3 May 2025 17:55 UTC; 11 points)
- gwern 23 Mar 2025 22:41 UTC
  18 points
  0
  Parent
  I don’t think it’s weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of ‘time’ would you have to start reaching for other explanations such as ‘divine benevolence’. (For example, you might appeal to ‘temporal decay’: if you benchmark on a dataset of only new data, in some way, then you will expect the oldest models to do the worse, and increasingly recent models do better, even after controlling for all factors you can think of—hey presto, a chart where the models mysteriously ‘get better over time’, even though if you had a time machine to benchmark each model at release in its own milieu, you’d find no trend.)
  What links here?
  - JenniferRM's comment on METR: Measuring AI Ability to Complete Long Tasks by Zach Stein-Perlman (24 Mar 2025 4:19 UTC; 4 points)
  - Thane Ruthenis 23 Mar 2025 22:57 UTC
    3 points
    0
    Parent
    I buy this for the post-GPT-3.5 era. What’s confusing me is that the rate of advancement in the pre-GPT-3.5 era was apparently the same as in the post-GPT-3.5 era, i. e., doubling every 7 months.
    Why would we expect there to be no distribution shift once the AI race kicked into high gear? GPT-2 to GPT-3 to GPT-3.5 proceeded at a snail’s pace by modern standards. How did the world happen to invest in them just enough for them to fit into the same trend?
    - ryan_greenblatt 24 Mar 2025 0:15 UTC
      11 points
      0
      Parent
      Actually, progress in 2024 is roughly 2x faster than earlier progress which seems consistent with thinking there is some distribution shift. It’s just that this distribution shift didn’t kick in until we had Anthropic competing with OpenAI and reasoning models. (Note that OpenAI didn’t release a notably better model than GPT-4-1106 until o1-preview!)
- ryan_greenblatt 24 Mar 2025 0:21 UTC
  12 points
  5
  Parent
  My sense is that the GPT-2 and GPT-3 results are somewhat dubious, especially the GPT-2 result. It really depends on how you relate SWAA (small software engineering subtasks) to the rest of the tasks. My understanding is that no iteration was done though.
  
  However, note that it wouldn’t be wildly more off trend if GPT-3 was anywhere from 4-30 seconds while it is instead at ~8 seconds. And, the GPT-2 results are very consistent with “almost too low to measure”.
  
  Overall, I don’t think its incredibly weird (given that the rate of increase of compute and people in 2019-2023 isn’t that different from the rate in 2024), but many results would have been roughly on trend.