There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn’t mean that the number of fruits is up 1000x in 3 years.
Price-performance of compute compounds over many years, but most algorithmic progress doesn’t, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn’t account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
This is a reasonable point in principle, but I don’t know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.
I think pretraining data pipeline improvements have this issue, they stop helping with larger models that want more data (or it becomes about midtraining). And similarly for the benchmark-placating better post-training data that enables ever less intelligent models to get good scores, but probably doesn’t add up to much (at least when it’s not pretraining-scale RLVR).
Things like MoE, GLU over LU, maybe DyT or Muon add up to a relatively modest compute multiplier over the original Transformer. For example Transformer++ vs. Transformer in Figure 4 of the Mamba paper suggests a total compute multiplier of 5x, attained over 6 years since the original Transformer (for dense models). This is emphatically not 3x-4x per year!
Chinchilla scaling is more about careful methodology with compute optimality rather than a specific algorithmic improvement, and even now most demonstrations of compute multipliers fail to take one of its lessons and cool down the models before measurement. This could lead to hilarious results such as Figure 11 of the OLMo 2 paper where an apparent 2x compute multiplier vanishes to nothing after cooling (admittedly, nobody expected this to be a real compute multiplier, but in a more confusing case it could’ve been taken to be one).
(a) is faster than your Mamba paper example but still much slower than 3-4x/year. (b) and (c) are at ~4x, though (c) isn’t much longer than a year. And these are basically not taking into account post-training efficiency gains iiuc.
We’re not working with many data points but it seems like these provide an existence proof that gains can compound across at least 3 years.
Would love to see some updated data collection on this, I think we could get more evidence on your hypothesis.
Mamba paper uses a relevant kind of methodology, it directly compares different algorithmic ingredients in the same setting, training on a fixed dataset and measuring perplexity (do note it’s not trying MoE, so the actual total improvement is greater). It’s a way of directly comparing cumulative improvement over all that time. To impact future frontier capabilities, an algorithmic ingredient from the past needs to be both applicable to the future frontier models, and help with benchmarks relevant to those frontier models, compared to the counterfactual where the frontier model doesn’t use the algorithmic ingredient.
When an ingredient stops being applicable to the frontier model, or stops being relevant to what’s currently important about its capabilities, it’s no longer compounding towards frontier capabilities. It wouldn’t matter if that same ingredient is helping a different contemporary non-frontier small model to match a much older model with much less compute. Or that it’s helping the frontier model to do much better than an older model on a benchmark that used to matter then, but doesn’t matter now.
So I’m skeptical of the Epoch paper’s overall framing, its willingness to compare everything against everything indirectly, that’s a lot of the point I’m making. You mostly can’t use methods from 2014 and frontier AI compute from 2025 to train something directly comparable to a lightweight version of a frontier model of 2025 trained on less compute (but still compute optimally), compared in a way that matters in 2025. So what does it mean that there is so and so compute multiplier across all of this time? At least for Transformer recipes, there is a possibility of comparing them directly if training converges.
Also, if we are not even aiming to do Chinchilla optimal training runs, what are we even comparing? For older algorithmic ingredients, you still need to aim for compute optimality to extract a meaningful compute multiplier, even if in the time of those older methods people didn’t even try to do that, or did it incorrectly. In terms of this comment’s framing, compute multipliers with respect to good methodology for Chinchilla optimal training is a “benchmark” that’s currently relevant. So even if this benchmark wasn’t appreciated or known back then, it’s still the thing to use in order to estimate cumulative impact of the older algorithmic improvements, in a way that is relevant now, and so in a way that’s analogous to what would be relevant for forecasting future frontier capabilities.
As another example, now that pretraining scale RLVR might soon become important, it’s less clear that Chinchilla optimality will remain relevant going forward, and so that the contributions of algorithmic improvements that helped improve perplexity in Chinchilla optimal settings will keep contributing to future frontier capabilities. If most relevant capabilities end up being learned with RLVR “directly”, then it might become less important how well pretraining works, even if it remains necessary for bootstrapping the process. And the kinds of things that RLVR trains will likely fail to help with perplexity in any reasonable setting, so measurements of perplexity will fail to remain a relevant benchmark.
There is some conceptual misleadingness with the usual ways of framing algorithmic progress. Imagine that in 2022 the number of apples produced on some farm increased 10x year-over-year, then in 2023 the number of oranges increased 10x, and then in 2024 the number of pears increased 10x. That doesn’t mean that the number of fruits is up 1000x in 3 years.
Price-performance of compute compounds over many years, but most algorithmic progress doesn’t, it only applies to the things relevant around the timeframe when that progress happens, and stops being applicable a few years later. So forecasting over multiple years in terms of effective compute that doesn’t account for this issue would greatly overestimate progress. There are some pieces of algorithmic progress that do compound, and it would be useful to treat them as fundamentally different from the transient kind.
This is a reasonable point in principle, but I don’t know how important it is in practice. My sense is that most things identified as algorithmic improvements continue to be algorithmic improvements over the previously-done thing at higher scales? E.g. transformers beating LSTMs, Chinchilla scaling, GeLU over ReLU, probably RL to train reasoning, etc.
I think pretraining data pipeline improvements have this issue, they stop helping with larger models that want more data (or it becomes about midtraining). And similarly for the benchmark-placating better post-training data that enables ever less intelligent models to get good scores, but probably doesn’t add up to much (at least when it’s not pretraining-scale RLVR).
Things like MoE, GLU over LU, maybe DyT or Muon add up to a relatively modest compute multiplier over the original Transformer. For example Transformer++ vs. Transformer in Figure 4 of the Mamba paper suggests a total compute multiplier of 5x, attained over 6 years since the original Transformer (for dense models). This is emphatically not 3x-4x per year!
Chinchilla scaling is more about careful methodology with compute optimality rather than a specific algorithmic improvement, and even now most demonstrations of compute multipliers fail to take one of its lessons and cool down the models before measurement. This could lead to hilarious results such as Figure 11 of the OLMo 2 paper where an apparent 2x compute multiplier vanishes to nothing after cooling (admittedly, nobody expected this to be a real compute multiplier, but in a more confusing case it could’ve been taken to be one).
In this Epoch paper appendix https://arxiv.org/pdf/2403.05812#page=12.3 they report efficiency improvements across 1.5+ years of time:
(a) is faster than your Mamba paper example but still much slower than 3-4x/year. (b) and (c) are at ~4x, though (c) isn’t much longer than a year. And these are basically not taking into account post-training efficiency gains iiuc.
We’re not working with many data points but it seems like these provide an existence proof that gains can compound across at least 3 years.
Would love to see some updated data collection on this, I think we could get more evidence on your hypothesis.
Mamba paper uses a relevant kind of methodology, it directly compares different algorithmic ingredients in the same setting, training on a fixed dataset and measuring perplexity (do note it’s not trying MoE, so the actual total improvement is greater). It’s a way of directly comparing cumulative improvement over all that time. To impact future frontier capabilities, an algorithmic ingredient from the past needs to be both applicable to the future frontier models, and help with benchmarks relevant to those frontier models, compared to the counterfactual where the frontier model doesn’t use the algorithmic ingredient.
When an ingredient stops being applicable to the frontier model, or stops being relevant to what’s currently important about its capabilities, it’s no longer compounding towards frontier capabilities. It wouldn’t matter if that same ingredient is helping a different contemporary non-frontier small model to match a much older model with much less compute. Or that it’s helping the frontier model to do much better than an older model on a benchmark that used to matter then, but doesn’t matter now.
So I’m skeptical of the Epoch paper’s overall framing, its willingness to compare everything against everything indirectly, that’s a lot of the point I’m making. You mostly can’t use methods from 2014 and frontier AI compute from 2025 to train something directly comparable to a lightweight version of a frontier model of 2025 trained on less compute (but still compute optimally), compared in a way that matters in 2025. So what does it mean that there is so and so compute multiplier across all of this time? At least for Transformer recipes, there is a possibility of comparing them directly if training converges.
Also, if we are not even aiming to do Chinchilla optimal training runs, what are we even comparing? For older algorithmic ingredients, you still need to aim for compute optimality to extract a meaningful compute multiplier, even if in the time of those older methods people didn’t even try to do that, or did it incorrectly. In terms of this comment’s framing, compute multipliers with respect to good methodology for Chinchilla optimal training is a “benchmark” that’s currently relevant. So even if this benchmark wasn’t appreciated or known back then, it’s still the thing to use in order to estimate cumulative impact of the older algorithmic improvements, in a way that is relevant now, and so in a way that’s analogous to what would be relevant for forecasting future frontier capabilities.
As another example, now that pretraining scale RLVR might soon become important, it’s less clear that Chinchilla optimality will remain relevant going forward, and so that the contributions of algorithmic improvements that helped improve perplexity in Chinchilla optimal settings will keep contributing to future frontier capabilities. If most relevant capabilities end up being learned with RLVR “directly”, then it might become less important how well pretraining works, even if it remains necessary for bootstrapping the process. And the kinds of things that RLVR trains will likely fail to help with perplexity in any reasonable setting, so measurements of perplexity will fail to remain a relevant benchmark.