gwern comments on METR: Measuring AI Ability to Complete Long Tasks

gwern 23 Mar 2025 22:41 UTC
18 points
0
I don’t think it’s weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of ‘time’ would you have to start reaching for other explanations such as ‘divine benevolence’. (For example, you might appeal to ‘temporal decay’: if you benchmark on a dataset of only new data, in some way, then you will expect the oldest models to do the worse, and increasingly recent models do better, even after controlling for all factors you can think of—hey presto, a chart where the models mysteriously ‘get better over time’, even though if you had a time machine to benchmark each model at release in its own milieu, you’d find no trend.)
What links here?
- JenniferRM's comment on METR: Measuring AI Ability to Complete Long Tasks by Zach Stein-Perlman (24 Mar 2025 4:19 UTC; 4 points)
- Thane Ruthenis 23 Mar 2025 22:57 UTC
  3 points
  0
  Parent
  I buy this for the post-GPT-3.5 era. What’s confusing me is that the rate of advancement in the pre-GPT-3.5 era was apparently the same as in the post-GPT-3.5 era, i. e., doubling every 7 months.
  Why would we expect there to be no distribution shift once the AI race kicked into high gear? GPT-2 to GPT-3 to GPT-3.5 proceeded at a snail’s pace by modern standards. How did the world happen to invest in them just enough for them to fit into the same trend?
  - ryan_greenblatt 24 Mar 2025 0:15 UTC
    11 points
    0
    Parent
    Actually, progress in 2024 is roughly 2x faster than earlier progress which seems consistent with thinking there is some distribution shift. It’s just that this distribution shift didn’t kick in until we had Anthropic competing with OpenAI and reasoning models. (Note that OpenAI didn’t release a notably better model than GPT-4-1106 until o1-preview!)