Daniel Kokotajlo comments on METR: Measuring AI Ability to Complete Long Tasks

Daniel Kokotajlo 19 Mar 2025 17:53 UTC
LW: 109 AF: 37
16
AF
This is probably the most important single piece of evidence about AGI timelines right now. Well done! I think the trend should be superexponential, e.g. each doubling takes 10% less calendar time on average. Eli Lifland and I did some calculations yesterday suggesting that this would get to AGI in 2028. Will do more serious investigation soon.

Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We’ve got to AIs that can with ~100% reliability do tasks that take professional humans 10 years. But somehow they can’t do tasks that take professional humans 160 years? And it’s going to take 4 more doublings to get there? And these 4 doublings are going to take 2 more years to occur? No, at some point you “jump all the way” to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.

Also, zooming in mechanistically on what’s going on, insofar as an AI system can do tasks below length X but not above length X, it’s gotta be for some reason—some skill that the AI lacks, which isn’t important for tasks below length X but which tends to be crucial for tasks above length X. But there are only a finite number of skills that humans have that AIs lack, and if we were to plot them on a horizon-length graph (where the x-axis is log of horizon length, and each skill is plotted on the x-axis where it starts being important, such that it’s not important to have for tasks less than that length) the distribution of skills by horizon length would presumably taper off, with tons of skills necessary for pretty short tasks, a decent amount necessary for medium tasks (but not short), and a long thin tail of skills that are necessary for long tasks (but not medium), a tail that eventually goes to 0, probably around a few years on the x-axis. So assuming AIs learn skills at a constant rate, we should see acceleration rather than a constant exponential. There just aren’t that many skills you need to operate for 10 days that you don’t also need to operate for 1 day, compared to how many skills you need to operate for 1 hour that you don’t also need to operate for 6 minutes.

There are two other factors worth mentioning which aren’t part of the above: One, the projected slowdown in capability advances that’ll come as compute and data scaling falters due to becoming too expensive. And two, pointing in the other direction, the projected speedup in capability advances that’ll come as AI systems start substantially accelerating AI R&D.
What links here?
- Petropolitan 19 Mar 2025 23:09 UTC
  22 points
  0
  Parent
  One of non-obvious but very important skills which all LLM-based SWE agents currently lack is reliably knowing which subtasks of a task you have successfully solved and which you have not. I think https://www.answer.ai/posts/2025-01-08-devin.html is a good case in point.
  We have absolutely seen a lot of progress on driving down hallucinations on longer and longer contexts with model scaling, they probably made the charts above possible in the first place. However, recent research (e. g., the NoLiMa benchmark from last month https://arxiv.org/html/2502.05167v1) demonstrates that effective context length falls far short of what is advertised. I assume it’s not just my personal experience but common knowledge among the practitioners that hallucinations become worse the more text you feed to an LLM.
  If I’m not mistaken even with all the optimizations and “efficient” transformer attempts we are still stuck (since GPT-2 at least) with self-attention + KV-cache^[1] which scale (at inference) linearly as long as you haven’t run out of memory and quadratically afterwards. Sure, MLA have just massively ramped up the context length at which the latter happens but it’s not unlimited, you won’t be able to cache, say, one day of work (especially since DRAM has not been scaling exponentially for years https://semianalysis.substack.com/p/the-memory-wall).
  People certainly will come up with ways to optimize long-context performance further, but it doesn’t have to continue scaling in the same way it has since 2019.
  1. ^
    Originally known as “past cache” after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it’s entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this
  What links here?
  - nostalgebraist 26 Mar 2025 17:59 UTC
    6 points
    0
    Parent
    Originally known as “past cache” after the tensor name apparently coined by Thomas Wolf for the transformers library in February 2019, see commit ffd6238. The invention has not been described in the literature AFAIK, and it’s entirely possible (maybe even likely) that closed-source implementations of earlier decoder-only transformers used the same trick before this
    KV caching (using the terminology “fast decoding” and “cache”) existed even in the original “Attention is All You Need” implementation of an enc-dec transformer. It was added on Sep 21 2017 in this commit. (I just learned this today, after I read your comment and got curious.)
    The “past” terminology in that original transformers implementation of GPT-2 was not coined by Wolf – he got it from the original OpenAI GPT-2 implementation, see here.
- Trinley Goldenberg 19 Mar 2025 18:02 UTC
  LW: 19 AF: 5
  8
  AF Parent
  I’m not at all convinced it has to be something discrete like “skills” or “achieved general intelligence”.
  There are many continuous factors that I can imagine that help planning long tasks.
  - J Bostock 19 Mar 2025 18:36 UTC
    11 points
    0
    Parent
    I second this, it could easily be things which we might describe as “amount of information that can be processed at once, including abstractions” which is some combination of residual stream width and context length.
    Imagine an AI can do a task that takes 1 hour. To remain coherent over 2 hours, it could either use twice as much working memory, or compress it into a higher level of abstraction. Humans seem to struggle with abstraction in a fairly continuous way (some people get stuck at algebra; some cs students make it all the way to recursion then hit a wall; some physics students can handle first quantization but not second quantization) which sorta implies there’s a maximum abstraction stack height which a mind can handle, which varies continuously.
    - Stephen Fowler 26 Mar 2025 1:20 UTC
      8 points
      4
      Parent
      While each mind might have a maximum abstraction height, I am not convinced that the inability of people to deal with increasingly complex topics is direct evidence of this.
      Is it that this topic is impossible for their mind to comprehend, or is it that they’ve simple failed to learn it in the finite time period they were given?
      - J Bostock 26 Mar 2025 10:14 UTC
        2 points
        0
        Parent
        That might be true but I’m not sure it matters. For an AI to learn an abstraction it will have a finite amount of training time, context length, search space width (if we’re doing parallel search like with o3) etc. and it’s not clear how the abstraction height will scale with those.
        Empirically, I think lots of people feel the experience of “hitting a wall” where they can learn abstraction level n-1 easily from class; abstraction level n takes significant study/help; abstraction level n+1 is not achievable for them within reasonable time. So it seems like the time requirement may scale quite rapidly with abstraction level?
  - Daniel Kokotajlo 19 Mar 2025 20:32 UTC
    LW: 3 AF: 2
    0
    AF Parent
    I’m not sure if I understand what you are saying. It sounds like you are accusing me of thinking that skills are binary—either you have them or you don’t. I agree, in reality many skills are scalar instead of binary; you can have them to greater or lesser degrees. I don’t think that changes the analysis much though.
    - Trinley Goldenberg 20 Mar 2025 1:43 UTC
      5 points
      1
      Parent
      length X but not above length X, it’s gotta be for some reason—some skill that the AI lacks, which isn’t important for tasks below length X but which tends to be crucial for tasks above length X.
      My point is, maybe there are just many skills that are at 50% of human, then go up to 60%, then 70%, etc, and can keep going up linearly to 200% or 300%. It’s not like it lacked the skill then suddenly stopped lacking it, it just got better and better at it
      - Daniel Kokotajlo 20 Mar 2025 15:36 UTC
        2 points
        −1
        Parent
        I agree with that, in fact I think that’s the default case. I don’t think it changes the bottom line, just makes the argument more complicated.
        Trinley Goldenberg 20 Mar 2025 17:50 UTC
        6 points
        4
        Parent
        I don’t see how the original argument goes through if it’s by default continuous.
- jsteinhardt 20 Mar 2025 6:17 UTC
  LW: 16 AF: 8
  −4
  AF Parent
  Doesn’t the trend line already take into account the effect you are positing? ML research engineers already say they get significant and increasing productivity boosts from AI assistants and have been for some time. I think the argument you are making is double-counting this. (Unless you want to argue that the kink with Claude is the start of the super-exponential, which we would presumably get data on pretty soon).
  - Daniel Kokotajlo 20 Mar 2025 15:38 UTC
    LW: 12 AF: 9
    1
    AF Parent
    I indeed think that AI assistance has been accelerating AI progress. However, so far the effect has been very small, like single-digit percentage points. So it won’t be distinguishable in the data from zero. But in the future if trends continue the effect will be large, possibly enough to more than counteract the effect of scaling slowing down, possibly not, we shall see.
    - jsteinhardt 20 Mar 2025 23:49 UTC
      LW: 8 AF: 4
      −8
      AF Parent
      Research engineers I talk to already report >3x speedups from AI assistants. It seems like that has to be enough that it would be showing up in the numbers. My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential.
      (This would argue for dropping the pre-2022 models from the graph which I think would give slightly faster doubling times, on the order of 5-6 months if I had to eyeball.).
      - habryka 21 Mar 2025 0:15 UTC
        LW: 22 AF: 13
        27
        AF Parent
        Research engineers I talk to already report >3x speedups from AI assistants
        Huh, I would be extremely surprised by this number. I program most days, in domains where AI assistance is particularly useful (frontend programming with relatively high churn), and I am definitely not anywhere near 3x total speedup. Maybe a 1.5x, maybe a 2x on good weeks, but definitely not a 3x. A >3x in any domain would be surprising, and my guess is generalization for research engineer code (as opposed to churn-heavy frontend development) is less.
        What links here?
        elifland's comment on METR: Measuring AI Ability to Complete Long Tasks by Zach Stein-Perlman (21 Mar 2025 16:05 UTC; 12 points)
        Ben Pace 21 Mar 2025 1:12 UTC
        LW: 5 AF: 4
        1
        AF Parent
        I think my front-end productivity might be up 3x? A shoggoth helped me building a stripe shop and do a ton of UI design that I would’ve been hesitant to take on myself (without hiring someone else to work with), as well as quality increase in speed of churning through front-end designs.
        (This is going from “wouldn’t take on the project due to low skill” to “can take it on and deliver it in a reasonable amount of time”, which is different from “takes top programmer and speeds them up 3x”.)
      - elifland 21 Mar 2025 16:05 UTC
        LW: 12 AF: 7
        12
        AF Parent
        I agree with habryka that the current speedup is probably substantially less than 3x.
        However, worth keeping in mind that if it were 3x for engineering the overall AI progress speedup would be substantially lower, due to (a) non-engineering activities having a lower speedup, (b) compute bottlenecks, (c) half of the default pace of progress coming from compute.
        My null hypothesis would be that programmer productivity is increasing exponentially and has been for ~2 years, and this is already being taken into account in the curves, and without this effect you would see a slower (though imo not massively slower) exponential
        Exponential growth alone doesn’t imply a significant effect here, if the current absolute speedup is low.
      - Daniel Kokotajlo 21 Mar 2025 23:58 UTC
        LW: 10 AF: 8
        9
        AF Parent
        I don’t believe it. I don’t believe that overall algorithmic progress is 3x faster. Maaaybe coding is 3x faster but that would maybe increase overall algo progress by like 30% idk. But also I don’t think coding is really 3x faster on average for the things that matter.
        jsteinhardt 22 Mar 2025 17:01 UTC
        LW: 5 AF: 3
        2
        AF Parent
        I meant coding in particular, I agree algorithmic progress is not 3x faster. I checked again just now with someone and they did indeed report 3x speedup for writing code, although said that the new bottleneck becomes waiting for experiments to run (note this is not obviously something that can be solved by greater automation, at least up until the point that AI is picking better experiments than humans).
- osten 19 Mar 2025 21:18 UTC
  9 points
  6
  Parent
  Ok, but why do you think that AIs learn skills at a constant rate? Might it be that higher level skills need more time to learn because compute scales exponentially with time but for higher level skills data is exponentially more scarce and needs linearly in task length more context, that is, total data processed scales superexponentially with task level?
- Thomas Kwa 30 Mar 2025 20:38 UTC
  LW: 8 AF: 5
  0
  AF Parent
  I basically agree with this. The reason the paper didn’t include this kind of reasoning (only a paragraph about how AGI will have infinite horizon length) is we felt that making a forecast based on a superexponential trend would be too much speculation for an academic paper. (There is really no way to make one without heavy reliance on priors; does it speed up by 10% per doubling or 20%?) It wasn’t necessary given the 2027 and 2029-2030 dates for 1-month AI derived from extrapolation already roughly bracketed our uncertainty.
- Anthony DiGiovanni 26 Mar 2025 5:20 UTC
  LW: 7 AF: 4
  2
  AF Parent
  No, at some point you “jump all the way” to AGI
  I’m confused as to what the actual argument for this is. It seems like you’ve just kinda asserted it. (I realize in some contexts all you can do is offer an “incredulous stare,” but this doesn’t seem like the kind of context where that suffices.)
  I’m not sure if the argument is supposed to be the stuff you say in the next paragraph (if so, the “Also” is confusing).
  - Daniel Kokotajlo 26 Mar 2025 6:59 UTC
    LW: 6 AF: 4
    0
    AF Parent
    Great question. You are forcing me to actually think through the argument more carefully. Here goes:
    Suppose we defined “t-AGI” as “An AI system that can do basically everything that professional humans can do in time t or less, and just as well, while being cheaper.” And we said AGI is an AI that can do everything at least as well as professional humans, while being cheaper.
    
    Well, then AGI = t-AGI for t=infinity. Because for anything professional humans can do, no matter how long it takes, AGI can do it at least as well.
    
    Now, METR’s definition is different. If I understand correctly, they made a dataset of AI R&D tasks, had humans give a baseline for how long it takes humans to do the tasks, and then had AIs do the tasks and found this nice relationship where AIs tend to be able to do tasks below time t but not above, for t which varies from AI to AI and increases as the AIs get smarter.
    
    ...I guess the summary is, if you think about horizon lengths as being relative to humans (i.e. the t-AGI definition above) then by definition you eventually “jump all the way to AGI” when you strictly dominate humans. But if you think of horizon length as being the length of task the AI can do vs. not do (*not* “as well as humans,” just “can do at all”) then it’s logically possible for horizon lengths to just smoothly grow for the next billion years and never reach infinity.
    
    So that’s the argument-by-definition. There’s also an intuition pump about the skills, which also was a pretty handwavy argument, but is separate.
    - nostalgebraist 26 Mar 2025 19:26 UTC
      LW: 31 AF: 15
      2
      AF Parent
      ICYMI, the same argument appears in the METR paper itself, in section 8.1 under “AGI will have ‘infinite’ horizon length.”
      The argument makes sense to me, but I’m not totally convinced.
      In METR’s definition, they condition on successful human task completion when computing task durations. This choice makes sense in their setting for reasons they discuss in B.1.1, but things would get weird if you tried to apply it to extremely long/hard tasks.
      If a typical time-to-success for a skilled human at some task is ~10 years, then the task is probably so ambitious that success is nowhere near guaranteed at 10 years, or possibly even within that human’s lifetime^[1]. It would understate the difficulty of the task to say it “takes 10 years for a human to do it”: the thing that takes 10 years is an ultimately successful human attempt, but most human attempts would never succeed at all.
      As a concrete example, consider “proving Fermat’s Last Theorem.” If we condition on task success, we have a sample containing just one example, in which a human (Andrew Wiles) did it in about 7 years. But this is not really “a task that a human can do in 7 years,” or even “a task that a human mathematician can do in 7 years” – it’s a task that took 7 years for Andrew Wiles, the one guy who finally succeeded after many failed attempts by highly skilled humans^[2].
      If an AI tried to prove or disprove a “comparably hard” conjecture and failed, it would be strange to say that it “couldn’t do things that humans can do in 7 years.” Humans can’t reliably do such things in 7 years; most things that take 7 years (conditional on success) cannot be done reliably by humans at all, for the same reasons that they take so long even in successful attempts. You just have to try and try and try and… maybe you succeed in a year, maybe in 7, maybe in 25, maybe you never do.
      So, if you came to me and said “this AI has a METR-style 50% time horizon of 10 years,” I would not be so sure that your AI is not an AGI.
      In fact, I think this probably would be an AGI. Think about what the description really means: “if you look at instances of successful task completion by humans, and filter to the cases that took 10 years for the successful humans to finish, the AI can succeed at 50% of them.” Such tasks are so hard that I’m not sure the human success rate is above 50%, even if you let the human spend their whole life on it; for all I know the human success rate might be far lower. So there may not be any well-defined thing left here that humans “can do” but which the AI “cannot do.”
      On another note, (maybe this is obvious but) if we do think that “AGI will have infinite horizon length” then I think it’s potentially misleading to say this means growth will be superexponential. The reason is that there are two things this could mean:
      “Based on my ‘gears-level’ model of AI development, I have some reason to believe this trend will accelerate beyond exponential in the future, due to some ‘low-level’ factors I know about independently from this discussion”
      “The exponential trend can never reach AGI, but I personally think we will reach AGI at some point, therefore the trend must speed up”
      I originally read it as 1, which would be a reason for shortening timelines: however “fast” things were from this METR trend alone, we have some reason to think they’ll get “even faster.” However, it seems like the intended reading is 2, and it would not make sense to shorten your timeline based on 2. (If someone thought the exponential growth was “enough for AGI,” then the observation in 2 introduces an additional milestone that needs to be crossed on the way to AGI, and their timeline should lengthen to accommodate it; if they didn’t think this then 2 is not news to them at all.)
      ^
      I was going to say something more here about the probability of success within the lifetimes of the person’s “intellectual heirs” after they’re dead, as a way of meaningfully defining task lengths once they’re >> 100 years, but then I realized that this introduces other complications because one human may have multiple “heirs” and that seems unfair to the AI if we’re trying to define AGI in terms of single-human performance. This complication exists but it’s not the one I’m trying to talk about in my comment...
      ^
      The comparison here is not really fair since Wiles built on a lot of work by earlier mathematicians – yet another conceptual complication of long task lengths that is not the one I’m trying to make a point about here.
      - Daniel Kokotajlo 26 Mar 2025 19:37 UTC
        LW: 8 AF: 7
        3
        AF Parent
        I found this comment helpful, thanks!
        
        The bottom line is basically “Either we definite horizon length in such a way that the trend has to be faster than exponential eventually (when we ‘jump all the way to AGI’) or we define it in such a way that some unknown finite horizon length matches the best humans and thus counts as AGI.”
        
        I think this discussion has overall made me less bullish on the conceptual argument and more interested in the intuition pump about the inherent difficulty of going from 1 to 10 hours being higher than the inherent difficulty of going from 1 to 10 years.
- Mo Putera 20 Mar 2025 14:56 UTC
  4 points
  2
  Parent
  Ben West’s remark in the METR blog post seems to suggest you’re right that the doubling period is shortening:
  … there are reasons to think that recent trends in AI are more predictive of future performance than pre-2024 trends. As shown above, when we fit a similar trend to just the 2024 and 2025 data, this shortens the estimate of when AI can complete month-long tasks with 50% reliability by about 2.5 years.
- satwik 19 Mar 2025 18:34 UTC
  3 points
  0
  Parent
  Any slowdown seems implausible given Anthropic timelines, which I consider to be a good reason to be skeptical of data and compute cost-related slowdowns at least until nobel-prize level. Moreover, the argument that we will very quickly get 15 OOMs or whatever of effective compute after the models can improve themselves is also very plausible
- Logan Zoellner 5 Apr 2025 13:51 UTC
  2 points
  0
  Parent
  Why do I expect the trend to be superexponential? Well, it seems like it sorta has to go superexponential eventually. Imagine: We’ve got to AIs that can with ~100% reliability do tasks that take professional humans 10 years. But somehow they can’t do tasks that take professional humans 160 years?
  I don’t think this means the real thing has to go hyper-exponential, just that “how long does it take humans to do a thing?” is a good metric when AI is sub-human but a poor one when AI is superhuman.
  
  If we had a metric “how many seconds/turn does a grandmaster have to think to beat the current best chess-playing AI”, it would go up at a nice steady rate until shortly after DeepBlue at which point it shoots to infinity. But if we had a true measurement of chess quality, we wouldn’t see any significant spike at the human-level.
- Siebe 19 Mar 2025 19:20 UTC
  2 points
  −15
  Parent
  One way to operationalize “160 years of human time” is “thing that can be achieved by a 160-person organisation in 1 year”, which seems like it would make sense?
  - ErioirE 19 Mar 2025 21:02 UTC
    8 points
    2
    Parent
    Unfortunately, when dealing with tasks such as software development it is nowhere near as linear as that.
    
    The meta-tasks of each additional dev needing to be brought up to speed on the intricacies of the project, as well as lost efficiency from poor communication/waiting on others to finish things means you usually get diminishing (or even inverse) returns from adding more people to the project.
    See: The Mythical Man Month
  - Mo Putera 20 Mar 2025 14:53 UTC
    1 point
    0
    Parent
    Not if some critical paths are irreducibly serial.
  - Rachel Shu 20 Mar 2025 13:00 UTC
    1 point
    0
    Parent
    Possibly, but then you have to consider you can spin up possibly arbitrarily many instances of the LLM as well, in which case you might expect the trend to go even faster, as now you’re scaling on 2 axes, and we know parallel compute scales exceptionally well.
    
    Parallel years don’t trade off exactly with years in series, but “20 people given 8 years” might do much more than 160 given one, or 1 given 160, depending on the task.
- anaguma 21 Mar 2025 16:23 UTC
  1 point
  0
  Parent
  No, at some point you “jump all the way” to AGI, i.e. AI systems that can do any length of task as well as professional humans -- 10 years, 100 years, 1000 years, etc.
  
  Isn’t the quadratic cost of context length a constraint here? Naively you’d expect that acting coherently over 100 years would require 10x the context, and therefore 100x the compute/memory, than 10 years.
  - Thomas Kwa 21 Mar 2025 18:59 UTC
    8 points
    4
    Parent
    Humans don’t need 10x more memory per step nor 100x more compute to do a 10-year project than a 1-year project, so this is proof it isn’t a hard constraint. It might need an architecture change but if the Gods of Straight Lines control the trend, AI companies will invent it as part of normal algorithmic progress and we will remain on an exponential / superexponential trend.