This is a linkpost to a new Substack article from MIT FutureTech explaining our recent paper On the Origins of Algorithmic Progress in AI.
We demonstrate that some algorithmic innovations have efficiency gains which get larger as pre-training compute increases. These scale-dependent innovations constitute the majority of pre-training efficiency gains over the last decade, which may imply that what looks like algorithmic progress is driven by compute scaling rather than many incremental innovations.
From the paper, our core contributions are:
We find most algorithmic innovations we experimentally evaluate have small, scale-invariant efficiency improvements with less than 10× compute efficiency gain overall, and representing less than 10% of total improvements extrapolated to the 2025 compute frontier (2 × 10²³ FLOPs). This suggests that scale-invariant algorithmic progress contributes only a minor share of overall efficiency improvements.
We find two strongly scale-dependent algorithmic innovations: LSTMs to Transformers, and Kaplan to Chinchilla re-balancing. Together, these account for 91% of total efficiency gains when extrapolating to the 2025 compute frontier. This implies that algorithmic progress for small-scale models is several orders of magnitude smaller than previously thought.
We show that in the presence of scale-dependent innovations, not only do efficiency gains require continued compute investment, but the rate of algorithmic progress strongly depends on your choice of reference algorithm. In other words, the rate of progress in successive models can appear exponential relative to one baseline algorithm, yet be zero relative to another.
MIT FutureTech is an interdisciplinary lab at the intersection of computer science and economics, focused specifically on trends in AI and computing, and funded in part by Coefficient Giving.
This seems to strongly suggest that a “software-only” intelligence explosion is unlikely, or at least, would be quite slow.
It looks like the opposite? If 91% of progress happened due to two changes in scaling regimes, who knows what is going to happen after third.
I agree that there s an issue with small sample size. But other than two software improvements that weren’t that big a change when they were introduced, but modified the slope so became bigger and bigger deals as hardware/money scaled up, none of the other software improvements have been that large. That suggests that if hardware and financial scaling goes away, and you’re left with only software scaling, or if you’re mid-FOOM and hardware and financial improvements are slow to accelerate but software in in a feedback loop, software improvements by themselves (in the absence of financial and hardware changes) are likely to be more of an intelligence damp squib than an intelligence explosion.
Another related issue is that there has bee suspicion that algorithmic improvements should eventually show diminishing returns. If you remove the two that produced very large improvements primarily by changing the slope for the other two variables, then you could squint at all the other software improvements and claim we were already somewhere into diminishing returns.
Disclaimer: I am not an expert in this area, and I would really like to see a reanalysis of this done by some people who are, such as the 2027 folks, in light of this observation. As a non-expert, my best guess is that this will make quite a big difference to the whole recursive-self-improvement step in timelines: it looks like it makes getting big effects off purely algorithmic improvements harder.
Except that we had Beren claim that SOTA algorithmic progress is mostly data progress. Which could also mean that the explosion may be based on architectures which have yet to be found, like the right version of neuralese. As far as I am aware, architectures like the Coconut paper by Meta or this paper on arxiv forget everything once they output the token, meaning that they are unlikely to be the optimal architecture.
As someone who has written that post, I think the title nowadays over claims, and the only reason I chose the title was because of the fact that it was used originally.
I’d probably argue that it explains a non-trivial amount of progress, but nowadays I’d focus way more on compute being the main driver of AI progress in general.
And yes, new architectures/loss functions could change the situation, but the paper is evidence that we should expect these architectures to rely on a lot of compute to be more efficient.
There are isoFLOPs for MoE models with different levels of sparsity, including dense transformers (on the same dataset), in this Jan 2025 paper, see Figure 12, left (in Appendix D.4). Figure 11 estimates compute multipliers given compute optimal amounts of data. My takeaway is that 1:8 sparsity gives a 3x compute multiplier (compared to dense), but requires 3x more tokens per active param for compute optimality, and similarly 1:32 sparsity gives a 6x compute multiplier, again at the cost of 6x more tokens per active param than would be compute optimal for a dense transformer.
Given the very different compute optimal amounts of data (between MoE and dense), and the tendency to wildly overtrain small models, the literature that doesn’t actually study the isoFLOPs (and only trains a dense/MoE counterpart as an afterthought in addition to the main thing the paper looks into) is going to underestimate the difference that only appears in full when both the dense transformer and the MoE transformer for the same amount of FLOPs get appropriate and very different numbers of tokens per active param (and correspondingly different numbers of active params).