In actuality, the study doesn’t say much about AGI, except to provide evidence against the most aggressive forecasts.
This feels quite wrong to me. Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years! So, I think the study does say something meaningful about AGI other than just evidence against shorter timelines[1]. I agree AGI might not happen in a few years after 1 months long software tasks and we’d have a richer understanding at the time, but the basic case in favor feels very strong to me.
(At a more basic level, if you would have updated a decent amount toward relatively longer timelines if the paper had 20 year timelines to 1 month SWE, then you must update to relatively shorter timelines given the trend is 5 years with a possibility of more like 2.5 due to the more recent faster trend. This is by conservation of expected evidence. This isn’t to say that you have to directionally update toward shorter timelines based on these results, e.g. maybe you expected an even faster trajectory and this seemed surprisingly slow extending your timelines.)
Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years!
Agreed. But that means time from today to AGI is the sum of:
Time for task horizons to increase from 1.5 hours (the preliminary o3 result) to 1 month
Plausibly “a few years” to progress from 1-month-coder to AGI.
If we take the midpoint of Thomas Kwa’s “3-4 months” guess for subsequent doubling time, we get 23.8 months for (1). If we take “a few years” to be 2 years, we’re in 2029, which is farther out than “the most aggressive forecasts” (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).
And given the starting assumptions, those are fairly aggressive numbers. Thomas’ guess that “capability on more realistic tasks will follow the long-term 7-month doubling time” would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.
Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from “the most aggressive forecasts”?
(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)
To be clear, I agree it provides evidence against very aggressive timelines (if I had 2027 medians I would have updated to longer), I was disagreeing with “the study doesn’t say much about AGI, except to”. I think the study does provide a bunch of evidence about when AGI might come! (And it seems you agree.) I edited my original comment to clarify this as I think I didn’t communicate what I was trying to say well.
If the trend isn’t inherently superexponential and continues at 7 month doubling times by default, it does seem hard to get to AGI within a few years. If it’s 4 months, IIRC in my timelines model it’s still usually after 2027 but it can be close because of intermediate AI R&D speedups depending on how big you think the gaps between benchmarks and the real world. I’d have to go back and look if we want a more precise answer. If you add error bars around the 4 month time, that increases the chance of AGI soon ofc.
If you treat the shift from 7 to 4 month doubling times as weak evidence of superexponential, that might be evidence in favor of 2027 timelines depending on your prior.
IMO how you should update on this just depends on your prior views (echoing Ryan’s comment). Daniel had 50% AGI by 2027 and did and should update to a bit lower. I’m at more like 20-25% and I think stay about the same (and I think Ryan is similar). I think if you have more like <=10% you should probably update upward.
Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I’d argue that the target success rate should be higher than 80%).
Also potential factors for “task messiness” and the 5-18x context penalty, though as you’ve pointed out elsewhere, the latter should arguably be discounted.
Personally, I updated toward shorter timelines upon seeing a preliminary version of their results which just showed the more recent doubling trend and then updated most of the way back on seeing the longer run trend. (Or maybe even toward slightly longer timelines than I started with, I forget.)
if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month)
This doesn’t seem like a good example to me.
The sort of tasks we’re talking about are extrapolations of current benchmark tasks, so it’s more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.
I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren’t tested in the benchmarks.
This feels quite wrong to me. Surely if AIs were completing 1 month long self contained software engineering tasks (e.g. what a smart intern might do in the first month) that would be a big update toward the plausiblity of AGI within a few years! So, I think the study does say something meaningful about AGI other than just evidence against shorter timelines[1]. I agree AGI might not happen in a few years after 1 months long software tasks and we’d have a richer understanding at the time, but the basic case in favor feels very strong to me.
(At a more basic level, if you would have updated a decent amount toward relatively longer timelines if the paper had 20 year timelines to 1 month SWE, then you must update to relatively shorter timelines given the trend is 5 years with a possibility of more like 2.5 due to the more recent faster trend. This is by conservation of expected evidence. This isn’t to say that you have to directionally update toward shorter timelines based on these results, e.g. maybe you expected an even faster trajectory and this seemed surprisingly slow extending your timelines.)
I edited this sentence in because I think my comment was originally confusing.
Agreed. But that means time from today to AGI is the sum of:
Time for task horizons to increase from 1.5 hours (the preliminary o3 result) to 1 month
Plausibly “a few years” to progress from 1-month-coder to AGI.
If we take the midpoint of Thomas Kwa’s “3-4 months” guess for subsequent doubling time, we get 23.8 months for (1). If we take “a few years” to be 2 years, we’re in 2029, which is farther out than “the most aggressive forecasts” (e.g. various statements by Dario Amodei, or the left side of the probability distribution in AI 2027).
And given the starting assumptions, those are fairly aggressive numbers. Thomas’ guess that “capability on more realistic tasks will follow the long-term 7-month doubling time” would push this out another two years, and one could propose longer timelines from one-month-coder to AGI.
Of course this is not proof of anything – for instance, task horizon doubling times could continue to accelerate, as envisioned in AI 2027 (IIRC), and one could also propose shorter timelines from one-month-coder to AGI. But I think the original statement is fair, even if we use 3-4 months as the doubling time, this is an update away from “the most aggressive forecasts”?
(When I wrote this, I was primarily thinking about Dario projecting imminent geniuses-in-a-data-center, and similar claims that AGI is coming within the next couple of years or even is already here.)
To be clear, I agree it provides evidence against very aggressive timelines (if I had 2027 medians I would have updated to longer), I was disagreeing with “the study doesn’t say much about AGI, except to”. I think the study does provide a bunch of evidence about when AGI might come! (And it seems you agree.) I edited my original comment to clarify this as I think I didn’t communicate what I was trying to say well.
If the trend isn’t inherently superexponential and continues at 7 month doubling times by default, it does seem hard to get to AGI within a few years. If it’s 4 months, IIRC in my timelines model it’s still usually after 2027 but it can be close because of intermediate AI R&D speedups depending on how big you think the gaps between benchmarks and the real world. I’d have to go back and look if we want a more precise answer. If you add error bars around the 4 month time, that increases the chance of AGI soon ofc.
If you treat the shift from 7 to 4 month doubling times as weak evidence of superexponential, that might be evidence in favor of 2027 timelines depending on your prior.
IMO how you should update on this just depends on your prior views (echoing Ryan’s comment). Daniel had 50% AGI by 2027 and did and should update to a bit lower. I’m at more like 20-25% and I think stay about the same (and I think Ryan is similar). I think if you have more like <=10% you should probably update upward.
Oops, I forgot to account for the gap from 50% success rate to 80% success (and actually I’d argue that the target success rate should be higher than 80%).
Also potential factors for “task messiness” and the 5-18x context penalty, though as you’ve pointed out elsewhere, the latter should arguably be discounted.
Personally, I updated toward shorter timelines upon seeing a preliminary version of their results which just showed the more recent doubling trend and then updated most of the way back on seeing the longer run trend. (Or maybe even toward slightly longer timelines than I started with, I forget.)
This doesn’t seem like a good example to me.
The sort of tasks we’re talking about are extrapolations of current benchmark tasks, so it’s more like: what a programming savant with almost no ability to interact with colleagues or search out new context might do in a month given a self-contained, thoroughly specced and vetted task.
I expect current systems will naively scale to that, but not to the abilities of an arbitrary intern because that requires skills that aren’t tested in the benchmarks.