The key question is whether you can find improvements which work at large scale using mostly small experiments, not whether the improvements work just as well at small scale.
The 3 largest algorithmic advances discussed here (Transformer, MoE, and MQA) were all originally found at tiny scale (~1 hr on an H100 or ~1e19 FLOP[1] which is ~7 orders of magnitude smaller than current frontier training runs).[2]
This paper looks at how improvements vary with scale, and finds the best improvements have returns which increase with scale. But, we care about predictability given careful analysis and scaling laws which aren’t really examined.
We found that, historically, the largest algorithmic advances couldn’t just be scaled up from smaller versions. They needed to have large amounts of compute to develop and validate
This is false: the largest 3 advances they identify were all first developed at tiny scale.
To be clear, the exact versions of these advances used in modern AIs are likely based on higher compute experiments. But, the returns from these more modern adaptations are unclear (and plausibly these adaptations could be found with small experiments using careful scaling analysis).
Separately, as far as I can tell, the experimental results in the paper shed no light on whether gains are compute-dependent (let alone predictable from small scale). Of the advances they experimentally test, only one (MQA) is identified as compute dependent.
They find that MQA doesn’t improve loss (at small scale). But, this isn’t how MQA is supposed to help, it is supposed to improve inference efficiency which they don’t test! So, these results only confirm that a bunch of innovations (RoPE, FA, LN) are in fact compute independent.
Ok, so does MQA improve inference at small scale?
The paper says:
At the time of its introduction in 2019, MQA was tested primarily on small models where memory constraints were not a major concern. As a result, its benefits were not immediately apparent. However, as model sizes grew, memory efficiency became increasingly important, making MQA a crucial optimization in modern LLMs
Memory constraints not being a major concern at small scale doesn’t mean it didn’t help then (at the time, I think people didn’t care as much about inference efficiency, especially decoder inference efficiency).
Separately, the inference performance improvements at large scale are easily predictable with first principles analysis!
The post misses all of this by saying:
MQA, then, by providing minimal benefit at small scale, but much larger benefit at larger scales —is a great example of the more-general class of a compute-dependent innovation.
I think it’s actually unclear if there was minimal benefit at small scale—maybe people just didn’t care much about (decoder) inference efficiency at the time—and further, the inference efficiency gain at large scale is easily predictable as I noted!
The post says:
compute-dependent improvements showed minimal benefit or actually hurt performance.
But, as I’ve noted, they only empirically tested MQA and those results are unclear! The transformer is well known to be a huge improvement even at very small scale. (I’m not sure about MoE.)
FAQ:
Q: Ok, but surely the fact that returns often vary with scale makes small scale experiments less useful?
A: Yes, returns varying with scale would reduce predictability (all else equal), but by how much? If returns improve in a predictable way that would be totally fine.
Careful science could (in principle) predict big gains at large scale despite minimal or negative gains at small scale.
Q: Ok, sure, but if you actually look at modern algorithmic secrets, they are probably much less predictable from small to large scale. (Of course, we don’t know that much with public knowledge.)
A: Seems quite plausible! In this case, we’re left with a quantitative question of how predictable things are, whether we can identify if something will be predictable, and if there are enough areas of progress which are predictable.
While 1e19 FLOP is around the scale of the final runs they included in each of these papers, these advances are pretty likely to have been initially found at (slightly) smaller scale. Like maybe 5-100x lower FLOP. The larger runs were presumably helpful for verifying the improvement, though I don’t think they were clearly essential, probably you could have instead done a bunch of careful scaling analysis.
Also, it’s worth noting that Transformer, MoE, and MQA are selected for being large single advances, making them unrepresentative. Large individual advances are probably typically easier to identify, making them more likely to be found earlier (and at smaller scale). We’d also expect large single improvements to be more likely to exhibit returns over a large range of different scales. But I didn’t pick these examples, they were just the main examples used in the paper!
Although if most alg breakthroughs provide bigger benefits at higher compute scales, then that suggests that the pace of alg progress has only been sustained by moving to bigger compute scales.
Then when compute is held constant, we’ll face much steeper DMR.
Might be possible to estimate the size of this effect quantitatively by looking at how much smaller the gains are at lower compute scales, and how quickly we’ve scaled up compute.
This is an additional point point to being bottlenecked on compute for experiments.
I’m a bit confused by what’s going on with the paper claiming their empirical results support innovations being compute-dependent when they only test MQA (and IMO show unclear results in this case). It almost seems like they forgot to include results or didn’t realize they only tested MQA because (e.g.) they talk about what their empirical results generally found for compute-dependent advancements (a class of 1), and the section on Sparse Attention links the empirical results section despite Sparse Attention not actually appearing in the empirical results section as far as I can tell!
To be clear, I agree that reducing availability of compute will substantially slow algorithmic research. So, export controls which do a good job of reducing the available amount of compute would slow algorithmic research progress.
If we have a fixed quantity (and quality) of human researchers and reduce the amount of compute by 5x at current margins, I’d expect algorithm progress would go maybe 2x slower.[1]
If AI R&D is fully automated, then I’d expect 5x less compute would make algorithmic progress go maybe 3.5x slower as both additional parallel researchers and experiments require compute.[2][3]
This is based on assuming Cobb-Douglas for the marginal returns and using something like serial_labor_speed0.5compute0.5 for 50.5≈2.2. If you back out the numbers from AI 2027′s survey numbers you get a compute exponent more like 0.43 which yields 50.43≈2.
Naively, cutting compute reduces experiment compute and also reduces the number of parallel AI researchers (because AI R&D is fully automated with AIs), but doesn’t alter the serial speed or quality of these AI researchers. But, there is a parallelization penalty because 10x more parallel workers is less good than 10x faster workers. We’ll say the marginal penalty is an exponent of around 0.5. So, you maybe get an overall speed up of parallel_labor0.5⋅0.5compute0.5=50.5⋅0.5+0.5≈3.34. If you back out the numbers from AI 2027′s survey numbers you get a compute exponent of 0.43 and an overall returns to parallel labor of 0.32 for 50.43+0.3≈3.34. My guess is both of these are slight underestimates of the slow down as I expect the exponent for returns to compute for experiments to be higher in the fully automated AI R&D regime and having less compute would also somewhat hit speed (and maybe quality?) in some cases. So, I rounded up to 3.5.
If external data is a key limiting factor, you’d expect a smaller slowdown, but I’m skeptical this will make a big difference. Also, external data would still presumably come in faster if you have more inference compute both so you can serve more AIs and have more AIs gather data.
The key question is whether you can find improvements which work at large scale using mostly small experiments, not whether the improvements work just as well at small scale.
The 3 largest algorithmic advances discussed here (Transformer, MoE, and MQA) were all originally found at tiny scale (~1 hr on an H100 or ~1e19 FLOP[1] which is ~7 orders of magnitude smaller than current frontier training runs).[2]
This paper looks at how improvements vary with scale, and finds the best improvements have returns which increase with scale. But, we care about predictability given careful analysis and scaling laws which aren’t really examined.
This is false: the largest 3 advances they identify were all first developed at tiny scale.
To be clear, the exact versions of these advances used in modern AIs are likely based on higher compute experiments. But, the returns from these more modern adaptations are unclear (and plausibly these adaptations could be found with small experiments using careful scaling analysis).
Separately, as far as I can tell, the experimental results in the paper shed no light on whether gains are compute-dependent (let alone predictable from small scale). Of the advances they experimentally test, only one (MQA) is identified as compute dependent.
They find that MQA doesn’t improve loss (at small scale). But, this isn’t how MQA is supposed to help, it is supposed to improve inference efficiency which they don’t test! So, these results only confirm that a bunch of innovations (RoPE, FA, LN) are in fact compute independent.
Ok, so does MQA improve inference at small scale?
The paper says:
Memory constraints not being a major concern at small scale doesn’t mean it didn’t help then (at the time, I think people didn’t care as much about inference efficiency, especially decoder inference efficiency).
Separately, the inference performance improvements at large scale are easily predictable with first principles analysis!
The post misses all of this by saying:
I think it’s actually unclear if there was minimal benefit at small scale—maybe people just didn’t care much about (decoder) inference efficiency at the time—and further, the inference efficiency gain at large scale is easily predictable as I noted!
The post says:
But, as I’ve noted, they only empirically tested MQA and those results are unclear! The transformer is well known to be a huge improvement even at very small scale. (I’m not sure about MoE.)
FAQ:
Q: Ok, but surely the fact that returns often vary with scale makes small scale experiments less useful?
A: Yes, returns varying with scale would reduce predictability (all else equal), but by how much? If returns improve in a predictable way that would be totally fine.
Careful science could (in principle) predict big gains at large scale despite minimal or negative gains at small scale.
Q: Ok, sure, but if you actually look at modern algorithmic secrets, they are probably much less predictable from small to large scale. (Of course, we don’t know that much with public knowledge.)
A: Seems quite plausible! In this case, we’re left with a quantitative question of how predictable things are, whether we can identify if something will be predictable, and if there are enough areas of progress which are predictable.
Everyone agrees compute is a key input, the question is just how far massively accelerated, much more capable, and vastly more prolific labor can push things.
This was also posted as a (poorly edited) tweet thread here.
While 1e19 FLOP is around the scale of the final runs they included in each of these papers, these advances are pretty likely to have been initially found at (slightly) smaller scale. Like maybe 5-100x lower FLOP. The larger runs were presumably helpful for verifying the improvement, though I don’t think they were clearly essential, probably you could have instead done a bunch of careful scaling analysis.
Also, it’s worth noting that Transformer, MoE, and MQA are selected for being large single advances, making them unrepresentative. Large individual advances are probably typically easier to identify, making them more likely to be found earlier (and at smaller scale). We’d also expect large single improvements to be more likely to exhibit returns over a large range of different scales. But I didn’t pick these examples, they were just the main examples used in the paper!
Although if most alg breakthroughs provide bigger benefits at higher compute scales, then that suggests that the pace of alg progress has only been sustained by moving to bigger compute scales.
Then when compute is held constant, we’ll face much steeper DMR.
Might be possible to estimate the size of this effect quantitatively by looking at how much smaller the gains are at lower compute scales, and how quickly we’ve scaled up compute.
This is an additional point point to being bottlenecked on compute for experiments.
Yeah, good point. This does make me wonder if we’ve actually seen a steady rate of algorithmic progress or if the rate has been increasing over time.
I’m a bit confused by what’s going on with the paper claiming their empirical results support innovations being compute-dependent when they only test MQA (and IMO show unclear results in this case). It almost seems like they forgot to include results or didn’t realize they only tested MQA because (e.g.) they talk about what their empirical results generally found for compute-dependent advancements (a class of 1), and the section on Sparse Attention links the empirical results section despite Sparse Attention not actually appearing in the empirical results section as far as I can tell!
To be clear, I agree that reducing availability of compute will substantially slow algorithmic research. So, export controls which do a good job of reducing the available amount of compute would slow algorithmic research progress.
If we have a fixed quantity (and quality) of human researchers and reduce the amount of compute by 5x at current margins, I’d expect algorithm progress would go maybe 2x slower.[1]
If AI R&D is fully automated, then I’d expect 5x less compute would make algorithmic progress go maybe 3.5x slower as both additional parallel researchers and experiments require compute.[2][3]
This is based on assuming Cobb-Douglas for the marginal returns and using something like serial_labor_speed0.5compute0.5 for 50.5≈2.2. If you back out the numbers from AI 2027′s survey numbers you get a compute exponent more like 0.43 which yields 50.43≈2.
Naively, cutting compute reduces experiment compute and also reduces the number of parallel AI researchers (because AI R&D is fully automated with AIs), but doesn’t alter the serial speed or quality of these AI researchers. But, there is a parallelization penalty because 10x more parallel workers is less good than 10x faster workers. We’ll say the marginal penalty is an exponent of around 0.5. So, you maybe get an overall speed up of parallel_labor0.5⋅0.5compute0.5=50.5⋅0.5+0.5≈3.34. If you back out the numbers from AI 2027′s survey numbers you get a compute exponent of 0.43 and an overall returns to parallel labor of 0.32 for 50.43+0.3≈3.34. My guess is both of these are slight underestimates of the slow down as I expect the exponent for returns to compute for experiments to be higher in the fully automated AI R&D regime and having less compute would also somewhat hit speed (and maybe quality?) in some cases. So, I rounded up to 3.5.
If external data is a key limiting factor, you’d expect a smaller slowdown, but I’m skeptical this will make a big difference. Also, external data would still presumably come in faster if you have more inference compute both so you can serve more AIs and have more AIs gather data.