Nikola Jurkovic comments on Reactions to METR task length paper are insane

Nikola Jurkovic 10 Apr 2025 20:03 UTC
6 points
0
The base models seem to have topped out their task length around 2023 at a few minutes (see on the plot that GPT-4o is little better than GPT-4). Reasoning models use search to do better.
Note that Claude 3.5 Sonnet (Old) and Claude 3.5 Sonnet (New) have a longer time horizon than 4o: 18 minutes and 28 minutes compared to 9 minutes (Figure 5 in Measuring AI Ability to Complete Long Tasks). GPT-4.5 also has a longer time horizon.
- Cole Wyeth 10 Apr 2025 20:21 UTC
  2 points
  0
  Parent
  Interesting, but still apparently on a significantly slower doubling time than the reasoning models?
  - AnthonyC 11 Apr 2025 18:27 UTC
    6 points
    4
    Parent
    Yes, the reasoning models seem to have accelerated things. ~7 months to ~4 months doubling time on that plot. I’m still not sure I follow why “They found a second way to accelerate progress that we can pursue in parallel to the first” would not cause me to think that progress in total will thereafter be faster. The advent of reasoning models has caused an acceleration of increasing capabilities, not in one or two domains like chess, but across a broad range of domains.
    - Cole Wyeth 11 Apr 2025 18:52 UTC
      2 points
      0
      Parent
      I think this is at least superficially a reasonable interpretation, and if the new linear relationship continues then I’d be convinced it’s right, but I wish you had engaged more with the arguments I made in the post or could be a bit more explicit about which you don’t follow?
      Basically, I just have very low confidence in putting a line through these points because I don’t see a principled reason to expect a linear relationship to hold, and I see some reasons to expect that it should not.
      - AnthonyC 11 Apr 2025 19:53 UTC
        5 points
        3
        Parent
        I also don’t have a principled reason to expect that particular linear relationship, except in general in forecasting tech advancements, I find that a lot of such relationships seem to happen and sustain themselves for longer than I’d expect given my lack of principled reasons for them.
        I did just post another comment reply that engages with some things you said.
        To the first argument: I agree with @Chris_Leong’s point about interest rates constituting essentially zero evidence, especially compared to the number of data points on the METR graph.
        To the second: I do not think the PhD thesis is a fair comparison. That is not a case where we expect anyone to successfully complete a task on their own. PhD students, post-docs, and professional researchers break a long task into many small ones, receive constant feedback, and change course in response to intermediate successes and failures. I don’t think there are actually very many tasks en route to a PhD tat can’t be broken down into predictable, well defined subtasks that take less than a month, and the task of doing the breaking down is itself a fairly short-time-horizon task that gets periodically revised. Even still, many PhD theses end up being, “Ok, you’ve done enough total work, how do we finagle these papers into a coherent narrative after the fact?” Plus, overall, PhD students, those motivated to go to grad school with enough demonstrated ability to get accepted into PhD programs, fail to get a PhD close to half the time even with all that.
        I imagine you could reliably complete a PhD in many fields with a week-long time horizon, as long as you get good enough weekly feedback from a competent advisor. 1: Talk to advisor about what it takes to get a PhD. 2: Divide into a list of <1 week-long tasks. 3) Complete task 1, get feedback, revise list. 4) Either repeat the current task or move on to the new next task, depending on feedback. 5) Loop until complete. 5a) Every ten or so loops, check overall progress to date against the original requirements. Evaluate whether overall pace of progress is acceptable. If not, come up with possible new plans and get advisor feedback.
        As far as not believing the current paradigm could reach AGI, which paradigm do you mean? I don’t think “random variation and rapid iteration” is a fair assessment of the current research process. But even if it were, what should I do with that information? Well, luckily we have a convenient example of what it takes for blind mutations with selection pressure to raise intelligence to human levels: us! I am pretty confident saying that current LLMs would outperform, say, Australopithecus, on any intellectual ability, but not Home sapiens. So that happens in a few million years, let’s say 200k generations of 10-100k individuals each, in which intelligence was one of many, many factors weakly driving selection pressure with at most a small number of variations per generation. I can’t really quantify how much human intelligence and directed effort speed up progress compared to blind chance, but consider that 1) a current biology grad student can do things with genetics in an afternoon that evolution needs thousands of generations and millions of individuals or more to do, and 2) the modern economic growth rate, essentially a sum of the impacts of human insight on human activity, is around 15000x faster than it was in the paleolithic. Naively extrapolated, this outside view would tell me that science and engineering can take us from Australopithecus-level to human-level in about 13 generations (unclear which generation we’re on now). The number of individuals needed per generation is dependent on how much we vary each individual, but plausibly in the single or double digits.
        My disagreement with your conclusion from your third objection is that scaling inference time compute increases performance within a generation, but that’s not how the iteration goes between generations. We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models. So if you build the first superhuman coder and find it’s expensive to run, what’s the most obvious next step in the chain? Follow the same process as we’ve been following for reasoning models and if straight lines on graphs hold, then six months later we’ll plausibly have one that’s a tenth the cost to run. Repeat again for the next six months after that.
        Vladimir_Nesov 15 Apr 2025 1:43 UTC
        6 points
        3
        Parent
        
        We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models.
        
        This kind of thing isn’t known to meaningfully work, as something that can potentially be done on pretraining scale. It also doesn’t seem plausible without additional breakthroughs given the nature and size of verifiable task datasets, with things like o3-mini getting ~matched on benchmarks by post-training on datasets containing 15K-120K problems. All the straight lines for reasoning models so far are only about scaling a little bit, using scarce resources that might run out (verifiable problems that help) and untried-at-more-scale algorithms that might break down (in a way that’s hard to fix). So the known benefit is still plausible to remain a one-time improvement, extending it significantly (into becoming a new direction of scaling) hasn’t been demonstrated.
        
        I think even remaining as a one-time improvement, long reasoning training might still be sufficient to get AI takeoff within a few years just from pretraining scaling of the underlying base models, but that’s not the same as already believing that RL post-training actually scales very far by itself. Most plausibly it does scale with more reasoning tokens in a trace, getting from the current ~50K to ~1M, but that’s separate from scaling with RL training all the way to pretraining scale (and possibly further).
        Cole Wyeth 13 Apr 2025 14:59 UTC
        2 points
        −2
        Parent
        I imagine you could reliably complete a PhD in many fields with a week-long time horizon, as long as you get good enough weekly feedback from a competent advisor. 1: Talk to advisor about what it takes to get a PhD. 2: Divide into a list of <1 week-long tasks. 3) Complete task 1, get feedback, revise list. 4) Either repeat the current task or move on to the new next task, depending on feedback. 5) Loop until complete. 5a) Every ten or so loops, check overall progress to date against the original requirements. Evaluate whether overall pace of progress is acceptable. If not, come up with possible new plans and get advisor feedback
        I think it’s nearly impossible to create unexpected new knowledge this way.
        As far as not believing the current paradigm could reach AGI, which paradigm do you mean? I don’t think “random variation and rapid iteration” is a fair assessment of the current research process. But even if it were, what should I do with that information? Well, luckily we have a convenient example of what it takes for blind mutations with selection pressure to raise intelligence to human levels: us! I am pretty confident saying that current LLMs would outperform, say, Australopithecus, on any intellectual ability, but not Home sapiens. So that happens in a few million years, let’s say 200k generations of 10-100k individuals each, in which intelligence was one of many, many factors weakly driving selection pressure with at most a small number of variations per generation. I can’t really quantify how much human intelligence and directed effort speed up progress compared to blind chance, but consider that 1) a current biology grad student can do things with genetics in an afternoon that evolution needs thousands of generations and millions of individuals or more to do, and 2) the modern economic growth rate, essentially a sum of the impacts of human insight on human activity, is around 15000x faster than it was in the paleolithic. Naively extrapolated, this outside view would tell me that science and engineering can take us from Australopithecus-level to human-level in about 13 generations (unclear which generation we’re on now). The number of individuals needed per generation is dependent on how much we vary each individual, but plausibly in the single or double digits.
        I can’t parse this.
        My disagreement with your conclusion from your third objection is that scaling inference time compute increases performance within a generation, but that’s not how the iteration goes between generations. We use reasoning models with more inference time compute to generate better data to train better base models to more efficiently reproduce similar capability levels with less compute to build better reasoning models. So if you build the first superhuman coder and find it’s expensive to run, what’s the most obvious next step in the chain? Follow the same process as we’ve been following for reasoning models and if straight lines on graphs hold, then six months later we’ll plausibly have one that’s a tenth the cost to run. Repeat again for the next six months after that.
        You’re probably right about distilling CoT.
        AnthonyC 13 Apr 2025 16:52 UTC
        4 points
        1
        Parent
        You’re right, but creating unexpected new knowledge is not a PhD requirement. I expect it’s pretty rare that a PhD students achieves that level of research.
        It wasn’t a great explanation, sorry, and there are definitely some leaps, digressions, and hand-wavy bits. But basically: Even if current AI research were all blind mutation and selection, we already know that that can yield general intelligence from animal-level-intelligence because evolution did it. And we already have various examples of how human research can apply much greater random and non-random mutation, larger individual changes, higher selection pressure in a preferred direction, and more horizontal transfer of traits than evolution can, enabling (very roughly estimated) ~3-5 OOMs greater progress per generation with fewer individuals and shorter generation times.
        Saw your edit above, thanks.
        Cole Wyeth 13 Apr 2025 18:20 UTC
        2 points
        −2
        Parent
        You’re right, but creating unexpected new knowledge is not a PhD requirement. I expect it’s pretty rare that a PhD students achieves that level of research.
        I do weakly expect it to be necessary to reach AGI though. Also, I personally wouldn’t want to do a PhD that didn’t achieve this!
        It wasn’t a great explanation, sorry, and there are definitely some leaps, digressions, and hand-wavy bits. But basically: Even if current AI research were all blind mutation and selection, we already know that that can yield general intelligence from animal-level-intelligence because evolution did it. And we already have various examples of how human research can apply much greater random and non-random mutation, larger individual changes, higher selection pressure in a preferred direction, and more horizontal transfer of traits than evolution can, enabling (very roughly estimated) ~3-5 OOMs greater progress per generation with fewer individuals and shorter generation times.
        Okay, then I understand the intuition but I think it needs a more rigorous analysis to even make an educated guess either way.
        Saw your edit above, thanks.
        No, thank you!
        AnthonyC 13 Apr 2025 18:40 UTC
        4 points
        0
        Parent
        I personally wouldn’t want to do a PhD that didn’t achieve this!
        
        Agreed. It was somewhere around reason #4 I quit my PhD program as soon as I qualified for a masters in passing.