Christopher King comments on METR: Measuring AI Ability to Complete Long Tasks

Christopher King 20 Mar 2025 13:44 UTC
17 points
5
I think the most mysterious part of this trend is that the x-axis is release date. Very useful but mysterious.
- Thane Ruthenis 23 Mar 2025 17:31 UTC
  15 points
  8
  Parent
  Indeed. That seems incredibly weird. It would be one thing if it were a function of parameter size, or FLOPs, or data, or at least the money invested. But the release date?
  The reasons why GPT-3, GPT-3.5, GPT-4o, Sonnet 3.6, and o1 improved on the SOTA are all different from each other, ranging from “bigger scale” to “first RLHF’d model” to “first multimodal model” to “algorithmic improvements/better data” to ”???” (Sonnet 3.6) to “first reasoning model”. And it’d be one thing if we could at least say that “for mysterious reasons, billion-dollar corporations trying incredibly hard to advance the frontier can’t do better than doubling the agency horizon every 7 months using any method”, but GPTs from −2 to −3.5 were developed in a completely different socioeconomic situation! There wasn’t an AI race dynamics, AGI companies were much poorer, etc. Yet they’re still part of the pattern.
  This basically only leaves teleological explanations, implies a divine plan for the rate of human technological advancement.
  Which makes me suspect there’s some error in the data, or the methodology was (accidentally) rigged to produce this result^[1]. Or perhaps there’s a selection bias where tons of people were trying various different ways to forecast AI progress, all methodologies which failed to produce a neat trend weren’t published, and we’re looking at a methodology that chanced upon a spurious correlation.
  Or I’m missing something obvious and it actually makes way more sense. Am I missing something obvious?
  1. ^
    For example: Was the benchmarking of how long a given type of task takes a human done prior to evaluating AI models, or was it done simultaneously with figuring out which models can do which tasks? I’d assume the methodology was locked in first, but if not...
  - gwern 23 Mar 2025 22:41 UTC
    18 points
    0
    Parent
    I don’t think it’s weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of ‘time’ would you have to start reaching for other explanations such as ‘divine benevolence’. (For example, you might appeal to ‘temporal decay’: if you benchmark on a dataset of only new data, in some way, then you will expect the oldest models to do the worse, and increasingly recent models do better, even after controlling for all factors you can think of—hey presto, a chart where the models mysteriously ‘get better over time’, even though if you had a time machine to benchmark each model at release in its own milieu, you’d find no trend.)
    What links here?
    JenniferRM's comment on METR: Measuring AI Ability to Complete Long Tasks by Zach Stein-Perlman (24 Mar 2025 4:19 UTC; 4 points)
    - Thane Ruthenis 23 Mar 2025 22:57 UTC
      3 points
      0
      Parent
      I buy this for the post-GPT-3.5 era. What’s confusing me is that the rate of advancement in the pre-GPT-3.5 era was apparently the same as in the post-GPT-3.5 era, i. e., doubling every 7 months.
      Why would we expect there to be no distribution shift once the AI race kicked into high gear? GPT-2 to GPT-3 to GPT-3.5 proceeded at a snail’s pace by modern standards. How did the world happen to invest in them just enough for them to fit into the same trend?
      - ryan_greenblatt 24 Mar 2025 0:15 UTC
        11 points
        0
        Parent
        Actually, progress in 2024 is roughly 2x faster than earlier progress which seems consistent with thinking there is some distribution shift. It’s just that this distribution shift didn’t kick in until we had Anthropic competing with OpenAI and reasoning models. (Note that OpenAI didn’t release a notably better model than GPT-4-1106 until o1-preview!)
  - ryan_greenblatt 24 Mar 2025 0:21 UTC
    12 points
    5
    Parent
    My sense is that the GPT-2 and GPT-3 results are somewhat dubious, especially the GPT-2 result. It really depends on how you relate SWAA (small software engineering subtasks) to the rest of the tasks. My understanding is that no iteration was done though.
    
    However, note that it wouldn’t be wildly more off trend if GPT-3 was anywhere from 4-30 seconds while it is instead at ~8 seconds. And, the GPT-2 results are very consistent with “almost too low to measure”.
    
    Overall, I don’t think its incredibly weird (given that the rate of increase of compute and people in 2019-2023 isn’t that different from the rate in 2024), but many results would have been roughly on trend.
- Leon Lang 23 Mar 2025 18:03 UTC
  7 points
  2
  Parent
  Do you think the x-axis being a release date is more mysterious than the same fact regarding Moore’s law?
  (Tbc., I think this doesn’t make it less mysterious: For Moore’s law this also seems like a mystery to me. But this analogy makes it more plausible that there is a mysterious but true reason driving such trends, instead of the graph from METR simply being a weird coincidence. )
  - Thane Ruthenis 23 Mar 2025 18:15 UTC
    4 points
    0
    Parent
    Hm, that’s a very good point.
    I think the amount of money-and-talent invested into the semiconductor industry has been much more stable than in AI though, no? Not constant, but growing steadily with the population/economy/etc. In addition, Moore’s law being so well-known potentially makes it a self-fulfilling prophecy, with the industry making it a target to aim for.
    - Raemon 24 Mar 2025 0:43 UTC
      16 points
      8
      Parent
      Also, have you tracked the previous discussion on Old Scott Alexander and LessWrong about generally “mysterious straight lines” being a surprisingly common phenomenon in economics. i.e. On an old AI post Oli noted:
      This is one of my major go-to examples of this really weird linear phenomenon:
      150 years of a completely straight line! There were two world wars in there, the development of artificial fertilizer, the broad industrialization of society, the invention of the car. And all throughout the line just carries one, with no significant perturbations.
      This doesn’t mean we should automatically take new proposed Straight Line Phenomena at face value, I don’t actually know if this is more like “pretty common actually” or “there are a few notable times it was true that are drawing undue attention.” But I’m at least not like “this is a never-before-seen anomaly”)
      - testingthewaters 24 Mar 2025 4:57 UTC
        2 points
        0
        Parent
        That surprisingly straight line reminds me of what happens when you use noise to regularise an otherwise decidedly non linear function: https://www.imaginary.org/snapshot/randomness-is-natural-an-introduction-to-regularisation-by-noise
    - JenniferRM 24 Mar 2025 4:19 UTC
      4 points
      0
      Parent
      Kurzweil (and gwern in a cousin comment) both think that “effort will be allocated efficiently over time” and for Kurzweil this explained much much more than just Moore’s Law.
      Ray’s charts from “the olden days” (the nineties and aughties and so on) were normalized around what “1000 (inflation adjusted) dollars spent on mechanical computing” could buy… and this let him put vacuum tubes and even steam-powered gear-based computers on a single chart… and it still worked.
      The 2020s have basically always been very likely to be crazy. Based on my familiarity with old ML/AI systems and standards, the term “AGI” as it was used a decade ago was already reached in the past. Claude is already smarter than most humans, but (from the perspective of what smart, numerate, and reasonable people predicted in 2009) he is (arguably) overbudget and behind schedule.