I’m cross-posting my guest post on Epoch’s Gradient Updates newsletter, in which I describe some new research from my team at UChicago’s XLab — roughly, the algorithmic improvements that most improve capabilities at scale are the ones that require the most compute to find and validate.
This week’s issue is a guest post by Henry Josephson, who is a research manager at UChicago’s XLab and an AI governance intern at Google DeepMind.
In the AI 2027 scenario, the authors predict a fast takeoff of AI systems recursively self-improving until we have superintelligence in just a few years.
Could this really happen? Whether it’s possible may depend on if a software intelligence explosion — a series of rapid algorithmic advances that lead to greater AI capabilities — occurs.
A key crux in the debate about the possibility of a software intelligence explosion comes down to whether key algorithmic improvements scale from small models to larger models. If the most important algorithmic advances need a large amount of compute to demonstrate their effectiveness, then we should think that a software-only intelligence explosion is less likely. And so a fast takeoff could be bottlenecked by compute constraints.
In a recent preprint, my team at UChicago’s XLab — Spencer Guo, Teddy Foley, Jack Sanderson, Anqi Qu, and I — reviewed the literature and performed some small-scale tests to investigate this crux.
We found that, historically, the largest algorithmic advances couldn’t just be scaled up from smaller versions. They needed to have large amounts of compute to develop and validate, making a software intelligence explosion more difficult to achieve than it might seem.
Are the best algorithmic improvements compute-dependent?
In our survey of the literature, we classified algorithmic advances as either:
Compute-independent, meaning the benefits of the advance appear at small scales and continue or accelerate as scale increases
Compute-dependent, meaning the benefits of the advance only appear with high amounts of compute, and the algorithmic advance provides little benefit (or even hurts performance) at low compute levels
Our investigation began by examining the architectural evolution of language models, starting around 2017 and getting as close to the present as we could. We focused on well-documented model families where implementation details were publicly available: early GPTs, BERT variants, LLaMAs, and DeepSeek.
We did our best to focus on successive model generations within a family, on the hypothesis that this is as close as we can get to controlling for variations in training data.[1] This meant that we could be more confident that increases in performance came from algorithmic advances — especially if we could control for scale.
This approach revealed patterns:
Within the GPTs, we saw the rise of sparse attention.
BERT variants showed us how layer normalization evolved into RMSNorm.
LLaMA iterations demonstrated the progression from standard attention to grouped-query attention.
DeepSeek’s releases, particularly constrained by hardware limitations, showed algorithms that enabled frontier performance without frontier compute.
These observations helped us build our experimental design. We selected algorithms that appeared repeatedly across model families — the ones that stuck around seemed more likely to represent real improvements. Given our compute constraints, we also picked algorithms that we could feasibly implement and test: multi-headed attention made the cut, but a full mixture-of-experts architecture didn’t.
The literature also informed how we classified an algorithm as “compute-dependent” or “compute-independent.” When we saw that certain algorithms only appeared in the largest models while others showed up in both large and small models, it suggested the compute-dependent vs. compute-independent distinction that we ended up testing.
Take, for example, multi-query attention (MQA), which reduces the amount of memory the attention mechanism needs. As we note in the paper,
At the time of its introduction in 2019, MQA was tested primarily on small models where memory constraints were not a major concern. As a result, its benefits were not immediately apparent. However, as model sizes grew, memory efficiency became increasingly important, making MQA a crucial optimization in modern LLMs (including Falcon, PaLM, and StarCoder).[2]
MQA, then, by providing minimal benefit at small scale, but much larger benefit at larger scales —is a great example of the more-general class of a compute-dependent innovation.
In contrast, algorithms like layer normalization showed up consistently across model sizes. From our review, we found that it “demonstrated speedups on the much smaller neural networks used in 2016 and is still often used in models today.”[3] So layer normalization is compute-independent.
To test our hypothesis that key algorithmic advances are compute-dependent, we implemented the above algorithmic improvements on a scaled-down version of GPT-2 (about 165M parameters instead of the full 1.5B). We measure compute-equivalent gain (CEG) by comparing how much additional compute would be needed to achieve the same cross-entropy loss on OpenWebText without the algorithmic improvement. For instance, a CEG of 2x means the algorithm lets us achieve the same loss with half the compute.
Here are our estimates of the compute-equivalent gain for each of the following algorithmic advances:
To put these numbers in context, Dario Amodei has noted that “every frontier AI company regularly discovers many of these compute-equivalent multipliers: frequently small ones (~1.2x), sometimes medium-sized ones (~2x), and every once in a while very large ones (~10x).”[4]
When we combined the compute-independent improvements (LayerNorm + RoPE + FlashAttention), we achieved an experimental CEG of about 3.5x, even on our small-scale model.
This improvement is far smaller than the documented returns from compute-dependent advances, which can deliver much larger improvements (10-50x or more) — but only for those with the compute budget to unlock them.
It’s also important to note that, while we’ve so far been using a nice clean binary between compute-dependent or independent, the reality is messier: the degree of compute-dependence varies.
Compare the modest gains from multi-query attention (1x for training) with the field-changing impact of the transformer architecture itself (10-50x at large scale). Even within compute-dependent innovations, some show near-linear scaling while others exhibit strongly super-linear improvements with scale, and compute-dependent improvements showed minimal benefit or actually hurt performance. This supports the hypothesis that some algorithmic advances only really shine when you scale them up.
Can Capabilities Advance With Frozen Compute? DeepSeek-V3
One real-world example that illustrates this dynamic is DeepSeek-V3, a recent frontier model from China. Despite facing export controls that limited their access to high-end chips, the final training run for 671B-parameter DeepSeek-v3 took just 2.788M H800 GPU hours. For comparison, the final training run for LLaMA 3.1-405B-Instruct[5] used about 30.84M hours on more powerful H100 GPUs.
How did DeepSeek achieve such strong performance while being so compute-constrained?
Algorithmic improvements, including multi-headed latent attention, a mixture-of-experts architecture, and mixed-precision training. This suggests that even with hardware limitations, algorithmic improvements can still drive meaningful advances. Tellingly, the most impactful innovations DeepSeek used were compute-dependent — they work best at scale.
What This Means for AI Progress
Our findings paint a nuanced picture of algorithmic progress. While both compute-independent and compute-dependent innovations drive capabilities forward, they play distinct roles with significant implications for the future trajectory of AI.
Our literature review and analysis reinforce that the most transformative algorithmic shifts, like the transformer architecture itself, have been compute-dependent. These innovations often yield minimal or no benefit at smaller scales but unlock order-of-magnitude improvements (10-50x CEG or more) when sufficient compute is available, accounting for the vast majority — perhaps ~99% — of the cumulative CEG impact historically. Hardware and algorithms are complements: frontier compute doesn’t just enable larger models, it unlocks new algorithmic paradigms.
Simultaneously, algorithms like Layer Normalization, RoPE, and FlashAttention provide tangible benefits across scales. Our experiments showed a combined 3.5x CEG from these even on our small model. Meaningful progress can occur even with fixed hardware. Such advances lower the bar for achieving a certain capability level and can be discovered and validated with less compute, potentially accelerating progress in resource-constrained environments.
But they remain small peanuts compared to the capability gains from compute-dependent algorithms, whose dominance suggests that access to large-scale compute confers a significant research advantage. It lets organizations test and validate the very algorithms that provide the largest gains — algorithms that might look unpromising or even detrimental at smaller scales.
Tom Davidson challenges this,[6] arguing that researchers might extrapolate from smaller experiments and noting that algorithmic progress hasn’t slowed despite fewer truly frontier-scale runs globally. This is a fair point, but I suspect it overlooks how much the definition of “small” has shifted for leading labs. Experiments on dozens or hundreds of H100s, while “small” relative to a full GPT-4 training run, represent vast compute resources compared to just a few years ago or what independent researchers can access. These “mid-scale” experiments likely provide enough signal — guided, as Davidson notes, by researcher taste and theoretical understanding — to identify promising compute-dependent innovations without needing constant frontier runs. Compute-rich labs maintain an edge not just through scale, but by efficiently exploring the algorithmic space at scales large enough to reveal compute-dependent effects.
The path to recursive self-improvement, then, seems likely to depend more on how easily AI systems can discover the compute-dependent advances that constitute the heavy tail — note that nearly all the compute-dependent advances we could find was the transformer and some of its derivatives.[7] Compute-dependent advances are hard to find and non-trivial to verify, but give the biggest returns on performance. The pace of takeoff, then, depends on the relative difficulty and infrequency of discovering them.
If you take one thing from our work, then, let it be that shorter timelines depend not only on algorithmic advances in the abstract being easier to find, but on compute-dependent algorithmic advances being easier to find.
Hardware controls (like export restrictions) will slow progress by hindering the discovery and validation of major compute-dependent breakthroughs, like those DeepSeek-V3 relied on. They likely won’t halt progress entirely, though, as compute-independent improvements will continue to accrue. Furthermore, the existence of compute-independent paths makes defining compute thresholds for regulation increasingly difficult, since significant capability gains might occur below anticipated FLOP levels. The rise of test-time advances only complicates this.
Limitations
Before closing, we should emphasize limitations. Our experiments used a relatively small model (165M parameters), and classifications relied on combining these results with literature analysis rather than multi-scale empirical tests. We measured CEG via pretraining loss, not downstream capabilities.
We didn’t analyze inference-time techniques, like chain-of-thought, best-of-k sampling, retrieval augmentation, etc., which likely have unique scaling properties. Epoch’s analysis offers some CEG estimates here, though potentially dated.
By focusing on pre-training, we also excluded post-training/RL techniques like fine-tuning, RLHF, DPO, etc. These often use smaller models but significant experiment counts, which means their compute dynamics will differ substantially from pre-training.
We treated data as fixed. The interaction between algorithmic improvements and dataset quality/composition is vital, and papers like “Scaling Data-Constrained Language Models” may offer data to study this.
Future work should investigate these areas, test classifications across domains (vision, biology), and ideally, replicate findings at larger scales.
Conclusion
Algorithmic progress interacts significantly with compute scale. While compute-independent advances offer a path for progress even with hardware constraints, the largest historical gains have come from compute-dependent innovations requiring substantial scale to develop and validate. Understanding this distinction is crucial for designing effective governance and forecasting AI progress — based on our experiments, we expect that the rate of progress will depend on how easily labs can find and validate compute-dependent algorithmic advances.
- ^
That is, we assume that e.g. LLaMA 3.2 is pretrained on ~the same data mix as LLaMA 3.1 — or at least the difference between the two’s data is negligible when you compare them to, say, BERT’s pretraining data.
- ^
p7.
- ^
p9.
- ^
Thank you to David Owen for flagging this — Also, Amodei cites Epoch work in the next sentence: “Because the value of having a more intelligent system is so high, this shifting of the curve typically causes companies to spend more, not less, on training models: the gains in cost efficiency end up entirely devoted to training smarter models, limited only by the company’s financial resources.”
- ^
Which, in case it isn’t clear, has 405B non-embedding parameters.
- ^
Thanks to Anson Ho for bringing this to my attention!
- ^
For further discussion on the tradeoffs between using compute for experiments, using compute for training, and using compute to run inference, see Jack Wiseman and Duncan Clements’s piece in Inference Magazine here.
The key question is whether you can find improvements which work at large scale using mostly small experiments, not whether the improvements work just as well at small scale.
The 3 largest algorithmic advances discussed here (Transformer, MoE, and MQA) were all originally found at tiny scale (~1 hr on an H100 or ~1e19 FLOP[1] which is ~7 orders of magnitude smaller than current frontier training runs).[2]
This paper looks at how improvements vary with scale, and finds the best improvements have returns which increase with scale. But, we care about predictability given careful analysis and scaling laws which aren’t really examined.
This is false: the largest 3 advances they identify were all first developed at tiny scale.
To be clear, the exact versions of these advances used in modern AIs are likely based on higher compute experiments. But, the returns from these more modern adaptations are unclear (and plausibly these adaptations could be found with small experiments using careful scaling analysis).
Separately, as far as I can tell, the experimental results in the paper shed no light on whether gains are compute-dependent (let alone predictable from small scale). Of the advances they experimentally test, only one (MQA) is identified as compute dependent.
They find that MQA doesn’t improve loss (at small scale). But, this isn’t how MQA is supposed to help, it is supposed to improve inference efficiency which they don’t test! So, these results only confirm that a bunch of innovations (RoPE, FA, LN) are in fact compute independent.
Ok, so does MQA improve inference at small scale?
The paper says:
Memory constraints not being a major concern at small scale doesn’t mean it didn’t help then (at the time, I think people didn’t care as much about inference efficiency, especially decoder inference efficiency).
Separately, the inference performance improvements at large scale are easily predictable with first principles analysis!
The post misses all of this by saying:
I think it’s actually unclear if there was minimal benefit at small scale—maybe people just didn’t care much about (decoder) inference efficiency at the time—and further, the inference efficiency gain at large scale is easily predictable as I noted!
The post says:
But, as I’ve noted, they only empirically tested MQA and those results are unclear! The transformer is well known to be a huge improvement even at very small scale. (I’m not sure about MoE.)
FAQ:
Q: Ok, but surely the fact that returns often vary with scale makes small scale experiments less useful?
A: Yes, returns varying with scale would reduce predictability (all else equal), but by how much? If returns improve in a predictable way that would be totally fine.
Careful science could (in principle) predict big gains at large scale despite minimal or negative gains at small scale.
Q: Ok, sure, but if you actually look at modern algorithmic secrets, they are probably much less predictable from small to large scale. (Of course, we don’t know that much with public knowledge.)
A: Seems quite plausible! In this case, we’re left with a quantitative question of how predictable things are, whether we can identify if something will be predictable, and if there are enough areas of progress which are predictable.
Everyone agrees compute is a key input, the question is just how far massively accelerated, much more capable, and vastly more prolific labor can push things.
This was also posted as a (poorly edited) tweet thread here.
While 1e19 FLOP is around the scale of the final runs they included in each of these papers, these advances are pretty likely to have been initially found at (slightly) smaller scale. Like maybe 5-100x lower FLOP. The larger runs were presumably helpful for verifying the improvement, though I don’t think they were clearly essential, probably you could have instead done a bunch of careful scaling analysis.
Also, it’s worth noting that Transformer, MoE, and MQA are selected for being large single advances, making them unrepresentative. Large individual advances are probably typically easier to identify, making them more likely to be found earlier (and at smaller scale). We’d also expect large single improvements to be more likely to exhibit returns over a large range of different scales. But I didn’t pick these examples, they were just the main examples used in the paper!
Although if most alg breakthroughs provide bigger benefits at higher compute scales, then that suggests that the pace of alg progress has only been sustained by moving to bigger compute scales.
Then when compute is held constant, we’ll face much steeper DMR.
Might be possible to estimate the size of this effect quantitatively by looking at how much smaller the gains are at lower compute scales, and how quickly we’ve scaled up compute.
This is an additional point point to being bottlenecked on compute for experiments.
Yeah, good point. This does make me wonder if we’ve actually seen a steady rate of algorithmic progress or if the rate has been increasing over time.
I’m a bit confused by what’s going on with the paper claiming their empirical results support innovations being compute-dependent when they only test MQA (and IMO show unclear results in this case). It almost seems like they forgot to include results or didn’t realize they only tested MQA because (e.g.) they talk about what their empirical results generally found for compute-dependent advancements (a class of 1), and the section on Sparse Attention links the empirical results section despite Sparse Attention not actually appearing in the empirical results section as far as I can tell!
To be clear, I agree that reducing availability of compute will substantially slow algorithmic research. So, export controls which do a good job of reducing the available amount of compute would slow algorithmic research progress.
If we have a fixed quantity (and quality) of human researchers and reduce the amount of compute by 5x at current margins, I’d expect algorithm progress would go maybe 2x slower.[1]
If AI R&D is fully automated, then I’d expect 5x less compute would make algorithmic progress go maybe 3.5x slower as both additional parallel researchers and experiments require compute.[2][3]
This is based on assuming Cobb-Douglas for the marginal returns and using something like serial_labor_speed0.5compute0.5 for 50.5≈2.2. If you back out the numbers from AI 2027′s survey numbers you get a compute exponent more like 0.43 which yields 50.43≈2.
Naively, cutting compute reduces experiment compute and also reduces the number of parallel AI researchers (because AI R&D is fully automated with AIs), but doesn’t alter the serial speed or quality of these AI researchers. But, there is a parallelization penalty because 10x more parallel workers is less good than 10x faster workers. We’ll say the marginal penalty is an exponent of around 0.5. So, you maybe get an overall speed up of parallel_labor0.5⋅0.5compute0.5=50.5⋅0.5+0.5≈3.34. If you back out the numbers from AI 2027′s survey numbers you get a compute exponent of 0.43 and an overall returns to parallel labor of 0.32 for 50.43+0.3≈3.34. My guess is both of these are slight underestimates of the slow down as I expect the exponent for returns to compute for experiments to be higher in the fully automated AI R&D regime and having less compute would also somewhat hit speed (and maybe quality?) in some cases. So, I rounded up to 3.5.
If external data is a key limiting factor, you’d expect a smaller slowdown, but I’m skeptical this will make a big difference. Also, external data would still presumably come in faster if you have more inference compute both so you can serve more AIs and have more AIs gather data.
Some compute dependent advancements are easier to extrapolate from small scale than others. For instance, I strongly suspect that small scale experiments + naively extrapolating memory usage is sufficient to discover (and be confident in) GQA. Note that the gpt-4 paper predicted the performance of gpt-4 from 1000x scaled down experiments! The gpt-4 scaling law extrapolation, and similar scaling laws work, is proof that a lot of advances can be extrapolated from much smaller compute scale.
Yeah but aren’t false positives also a problem here?
It seems to me that the sheer dominance of transformer models compute efficiency improvements is a horrible sign.
A large number of iterative improvements should be somewhat easy to monitor, a small probability of a sudden massive leap in capabilities is a lot harder. Especially since, a priori, I see no reason for all future compute dependant advances to happen at close to cutting edge levels of compute.
It seems entirely possible to me that something like a future version of alphatensor could stumble upon one of these hidden gems of an optimisation, verify it with a level of compute it has unmonitored access to, and quickly dominate the competition in compute overnight.
Black swan events are scary.