I think this is an interesting analysis. But it seems like you’re updating more strongly on it than I am. Here are some thoughts.
The Forethought SIE model doesn’t seem to apply well to this data: In the Forethought model (without the ceiling add-on), the growth rate of “cognitive labor” is assumed to equal the growth rate of “software efficiency”, which is in turn assumed to be proportional to the growth rate of cumulative “cognitive labor” (no matter the fixed level of compute). Whether this fooms or fizzles is determined only by whether this constant of proportionality r is above or below 1.
In this framework, and with fixed experiment compute, the game is to find a trend for “software efficiency”, then find a trend for cumulative “cognitive labor”, then see which one is growing faster. So it matters quite a lot how you measure these two trends. (do you use the raw multiplier for software efficiency? or the number of OOMs? What about “cognitive labor”?)
It’s worth stating that regardless of the value of r, this model predicts that (in the right units) steady growth in cognitive labor (or cumulative cognitive labor) yields steady growth in software efficiency. (in the usual units, this means that the plot of log(2025-FLOP per FLOP) vs log(researcher-hours) is a straight line with slope r.) A plot that curves downward or “hits a wall” seems like evidence against this model’s applicability to the data. This could be due to ceiling effects (as habryka mentioned), or LoC being a poor proxy for researcher-hours (as Tao mentioned), or violation of the fixed-experiment-compute assumption, or other limitations of the model.
When I naively apply the Forethought SIE model to data on AGI labs, I get the opposite result: Looking at OpenAI from January 2023[1] until today, I estimate that total headcount has grown at 2.3x per year,[2] while experiment compute available (in H100e) has grown at 3.7x per year. My median estimate of the rate of algorithmic progress over that time period is 10x/year[3] --- exceeding the growth rates of both inputs.[4] Naively applying the Forethought model, we can see that r>1 by a decent margin,[5] so I should expect that we’re headed for a “software-only singularity”. However…
I think the most important unknowns for takeoff speeds require splitting up the “cognitive labor” abstraction: There are certain kinds of cognitive labor (e.g. writing code to implement ideas for research experiments) which are subject to compute bottlenecks—in the sense that even with an unlimited quantity of (that kind of) labor, you would eventually be rate-limited by how many experiments (of unit compute scale) could be run at one time. See “Will Compute Bottlenecks Prevent a Software Intelligence Explosion?” (also by Davidson/Forethought!) for good discussion of this. If this were the only kind of “cognitive labor” which could meaningfully yield algorithmic progress, then my previous inference that we’re headed for an SIE would be totally wrong.
There may be other kinds of useful cognitive labor which are not subject to compute bottlenecks, most prominently research taste / experiment selection skill. We tried to model this in the AI Futures Model. Unfortunately very little data exists on the relationship between algorithmic progress and research taste.[6] It’s also unclear how quickly the research taste of automated researchers will improve compared to other capabilities.[7]The plausibility of a “taste-only singularity” (which currently controls most of my probability mass on very fast takeoffs) depends crucially on these quantities!
Cognitive quality of future AIs being so high that one hour of AIs researching is equivalent to exponentially large quantities of human researcher hours, even if they don’t train much more efficiently to the same capability level. This is the most important question to answer and something I hope METR does experiments on in Q1.
Mainly because there is a trend break in the labor time series around the time ChatGPT was released, also because my algorithmic progress estimate is mostly based on post-2023 data.
The growth rate of research staff might be slower due to an increasing fraction of non-research (e.g. product, sales, marketing) staff post-2023. But it seems unlikely to be faster than the growth rate of the total.
I think ECI is the current best metric available for measuring algorithmic progress. The ECI paper does a bunch of different analyses which give different central estimates, all of which are above 3x/year (except for one in Appendix C.1.2, which the authors state introduces a downward bias.) My preferred method is doing a joint linear regression of ECI as a function of training compute and release date, than looking at the slope of the resulting “lines of constant ECI.” Doing this on the whole data set gives ~20x/year; filtering to “non-distilled” models” gives a similar answer; filtering to only models which advanced the frontier of compute efficiency at release gives ~30x per year; filtering on both conditions gives more like 8x per year. Code for these analyses can be found here. It’s unclear which of these is most applicable to the question of future software intelligence explosions, but it seems safe to say that 3x per year is too low. See also Aaron Scher’s recent post “Catch-Up Algorithmic Progress Might Actually be 60x per Year”.
It might be better to compare the rate of algorithmic progress specifically achieved by OpenAI to the growth rate of OpenAI compute and labor. The data is much sparser however, and I don’t see a strong reason why it should differ in one direction or the other. It’s also hard to separate out the effects of knowledge diffusion between labs. Alternatively I could use estimates for worldwide AGI-researcher labor growth rates and worldwide AGI-research compute growth rates. I am less certain about these as well.
The specific calculated value of r would depend on the value of α. If α were 1 it would be log(10)/log(2.3) = 2.76; if α were 0 it would be 1.76. If α were in between, r would be in between.
For the purposes of our forecast, we surveyed various AI researchers to estimate the multiplier on the rate of algorithmic progress corresponding to median vs top-human-level research taste, and got a central estimate of 3.7x. It would be better to do an actual experiment. (toy example: have many many many different AI researchers attempt something like the NanoGPT speedrun, but individually. measure the variance. Seems hard to do well.)
As measured by effect on takeoff speeds, this single parameter is where most of our uncertainty comes from. Our forecast looks at how quickly AI capabilities have moved through the human range in a bunch of different domains, and uses that as a reference class for an adjusted estimate. This is super uncertain.
(in the usual units, this means that the plot of log(2025-FLOP per FLOP) vs log(researcher-hours) is a straight line with slope r.) A plot that curves downward or “hits a wall” seems like evidence against this model’s applicability to the data.
Note there are no log-log plots in the data. They’re performance vs LoC and log(performance) vs LoC, and same for stars. I don’t think we’re at an absolute ceiling since two more improvements came out in the past week, they’ve just gotten smaller and taken more code to implement.
I need to think about this algorithmic progress being 10x/year thing. It feels like some assumptions are violated with how much the data seem to give inconsistent answers, maybe there’s a prospective vs retrospective difference. Or do you think progress has just sped up in the past couple of years?
Progress probably has sped up in the past couple of years. And training compute scaling has, if anything, slowed down (it hasn’t accelerated, anyway). So yes, I think “software progress” probably has sped up in the past couple of years.
I haven’t looked into whether you can see the algorithmic progress speedup in the ECI data using the methodology I was describing. The data would be very sparse if you e.g. tried to restrict to pre-2024 models for greater alignment with the Algorithmic Progress in Language Models paper, which is where the 3x per year number comes from.
Also, that 3x per year number is only measuring pre-training improvements. Post-training (1) didn’t really exist before 2022 and (2) was notably accelerated in 2024 by the introduction of RLVR. I wouldn’t be confident in whether pre-training algorithmic progress alone is much faster than 3x per year today. (as rumor would have it, there’s substantial divergence between the different AGI companies on the rate of pretraining progress.)
I think this is an interesting analysis. But it seems like you’re updating more strongly on it than I am. Here are some thoughts.
The Forethought SIE model doesn’t seem to apply well to this data:
In the Forethought model (without the ceiling add-on), the growth rate of “cognitive labor” is assumed to equal the growth rate of “software efficiency”, which is in turn assumed to be proportional to the growth rate of cumulative “cognitive labor” (no matter the fixed level of compute). Whether this fooms or fizzles is determined only by whether this constant of proportionality r is above or below 1.
In this framework, and with fixed experiment compute, the game is to find a trend for “software efficiency”, then find a trend for cumulative “cognitive labor”, then see which one is growing faster. So it matters quite a lot how you measure these two trends. (do you use the raw multiplier for software efficiency? or the number of OOMs? What about “cognitive labor”?)
It’s worth stating that regardless of the value of r, this model predicts that (in the right units) steady growth in cognitive labor (or cumulative cognitive labor) yields steady growth in software efficiency. (in the usual units, this means that the plot of log(2025-FLOP per FLOP) vs log(researcher-hours) is a straight line with slope r.) A plot that curves downward or “hits a wall” seems like evidence against this model’s applicability to the data. This could be due to ceiling effects (as habryka mentioned), or LoC being a poor proxy for researcher-hours (as Tao mentioned), or violation of the fixed-experiment-compute assumption, or other limitations of the model.
When I naively apply the Forethought SIE model to data on AGI labs, I get the opposite result:
Looking at OpenAI from January 2023[1] until today, I estimate that total headcount has grown at 2.3x per year,[2] while experiment compute available (in H100e) has grown at 3.7x per year. My median estimate of the rate of algorithmic progress over that time period is 10x/year[3] --- exceeding the growth rates of both inputs.[4] Naively applying the Forethought model, we can see that r>1 by a decent margin,[5] so I should expect that we’re headed for a “software-only singularity”. However…
I think the most important unknowns for takeoff speeds require splitting up the “cognitive labor” abstraction:
There are certain kinds of cognitive labor (e.g. writing code to implement ideas for research experiments) which are subject to compute bottlenecks—in the sense that even with an unlimited quantity of (that kind of) labor, you would eventually be rate-limited by how many experiments (of unit compute scale) could be run at one time. See “Will Compute Bottlenecks Prevent a Software Intelligence Explosion?” (also by Davidson/Forethought!) for good discussion of this. If this were the only kind of “cognitive labor” which could meaningfully yield algorithmic progress, then my previous inference that we’re headed for an SIE would be totally wrong.
There may be other kinds of useful cognitive labor which are not subject to compute bottlenecks, most prominently research taste / experiment selection skill. We tried to model this in the AI Futures Model. Unfortunately very little data exists on the relationship between algorithmic progress and research taste.[6] It’s also unclear how quickly the research taste of automated researchers will improve compared to other capabilities.[7] The plausibility of a “taste-only singularity” (which currently controls most of my probability mass on very fast takeoffs) depends crucially on these quantities!
I also hope METR does this!
Mainly because there is a trend break in the labor time series around the time ChatGPT was released, also because my algorithmic progress estimate is mostly based on post-2023 data.
The growth rate of research staff might be slower due to an increasing fraction of non-research (e.g. product, sales, marketing) staff post-2023. But it seems unlikely to be faster than the growth rate of the total.
I think ECI is the current best metric available for measuring algorithmic progress. The ECI paper does a bunch of different analyses which give different central estimates, all of which are above 3x/year (except for one in Appendix C.1.2, which the authors state introduces a downward bias.) My preferred method is doing a joint linear regression of ECI as a function of training compute and release date, than looking at the slope of the resulting “lines of constant ECI.” Doing this on the whole data set gives ~20x/year; filtering to “non-distilled” models” gives a similar answer; filtering to only models which advanced the frontier of compute efficiency at release gives ~30x per year; filtering on both conditions gives more like 8x per year. Code for these analyses can be found here. It’s unclear which of these is most applicable to the question of future software intelligence explosions, but it seems safe to say that 3x per year is too low. See also Aaron Scher’s recent post “Catch-Up Algorithmic Progress Might Actually be 60x per Year”.
It might be better to compare the rate of algorithmic progress specifically achieved by OpenAI to the growth rate of OpenAI compute and labor. The data is much sparser however, and I don’t see a strong reason why it should differ in one direction or the other. It’s also hard to separate out the effects of knowledge diffusion between labs. Alternatively I could use estimates for worldwide AGI-researcher labor growth rates and worldwide AGI-research compute growth rates. I am less certain about these as well.
The specific calculated value of r would depend on the value of α. If α were 1 it would be log(10)/log(2.3) = 2.76; if α were 0 it would be 1.76. If α were in between, r would be in between.
For the purposes of our forecast, we surveyed various AI researchers to estimate the multiplier on the rate of algorithmic progress corresponding to median vs top-human-level research taste, and got a central estimate of 3.7x. It would be better to do an actual experiment. (toy example: have many many many different AI researchers attempt something like the NanoGPT speedrun, but individually. measure the variance. Seems hard to do well.)
As measured by effect on takeoff speeds, this single parameter is where most of our uncertainty comes from. Our forecast looks at how quickly AI capabilities have moved through the human range in a bunch of different domains, and uses that as a reference class for an adjusted estimate. This is super uncertain.
Note there are no log-log plots in the data. They’re performance vs LoC and log(performance) vs LoC, and same for stars. I don’t think we’re at an absolute ceiling since two more improvements came out in the past week, they’ve just gotten smaller and taken more code to implement.
I need to think about this algorithmic progress being 10x/year thing. It feels like some assumptions are violated with how much the data seem to give inconsistent answers, maybe there’s a prospective vs retrospective difference. Or do you think progress has just sped up in the past couple of years?
Progress probably has sped up in the past couple of years. And training compute scaling has, if anything, slowed down (it hasn’t accelerated, anyway). So yes, I think “software progress” probably has sped up in the past couple of years.
I haven’t looked into whether you can see the algorithmic progress speedup in the ECI data using the methodology I was describing. The data would be very sparse if you e.g. tried to restrict to pre-2024 models for greater alignment with the Algorithmic Progress in Language Models paper, which is where the 3x per year number comes from.
Also, that 3x per year number is only measuring pre-training improvements. Post-training (1) didn’t really exist before 2022 and (2) was notably accelerated in 2024 by the introduction of RLVR. I wouldn’t be confident in whether pre-training algorithmic progress alone is much faster than 3x per year today. (as rumor would have it, there’s substantial divergence between the different AGI companies on the rate of pretraining progress.)