I believe open data pretty strongly contradicts your claims
System efficiency: ~2x, not ~20x
§1.2 estimates “up to 20x” from system optimizations (quantization, parallelization, FlashAttention, etc.). But model FLOPs Utilization, the fraction of peak hardware FLOPs that large training runs actually achieve, has barely changed. Here are the most legible training run flops efficiencies:
That’s less than 2x variation over 6 years. Most optimization work at scale is treading water against communication overhead as clusters grow larger, not increasing training compute per chip. Also, individual optimizations like FlashAttention’s 3x attention speedup are additive rather than multiplicative, and many other optimizations only apply to inference, not training.
The Gundlach paper tests 6 innovations. The field has produced 40+.
§1.1 relies on Gundlach et al. (2025b), which ablated six things, SwiGLU, RoPE, cosine decay, pre-RMSNorm, pre-LayerNorm, and AdamW. The most algorithmically advanced models with published methodology are DeepSeek-V3, DeepSeek-R1, and Kimi K2, and here’s a list of algorithms they used that Gundlach didn’t test:
Architecture:
Fine-grained MoE: 256 routed + 1 shared expert, 8 active per token
Ultra-sparse MoE: scaling to 384 experts at fixed active params
No token-dropping in MoE (enabled by better balancing)
Rule-based rewards avoiding neural reward model hacking
Of these, Muon alone is ~2x, which is on par with AdamW vs SGD.
HellaSwag: an example 10x+ compute multiplier, without distillation or synthetic data
Three model families, each trained at a fixed recipe across multiple sizes: Gopher (2021), Cerebras-GPT (2023, Chinchilla-optimal), and Pythia (2023), and they didn’t use any distillation, targeted data similar to benchmarks, or data contractors. Gopher reaches the same HellaSwag score as Cerebras-GPT or Pythia with roughly 10x less compute, and while using Kaplan parameter scaling. There was pretty much no legible algorithm in the paper that obviously caused this difference.
There’s lots more evidence, and many models such as Qwen or R1 got significantly higher compute multipliers, but as you note, some fraction of that is from distillation or data labelling.
Note: Claude 4.6 wrote most of this, with a huge number of revisions and fixes from me, it wasn’t really an uplift compared to only using Claude to make the list and the plot.
I just want to make it clear that both are paper and Epoch’s paper addresses innovations that occur from 2012-2023 (and only the first half of 2023). We are aware of MLA, muon optimizer, long context unlocks, and RL, and think they are important contributors. However, all these innovations are explicitly outside of the scope of our current paper which seeks to account for Epoch’s estimates in that time period.
Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!
…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.
The hypothesis that Gopher is better than the other two mainly because of better training data seems like a totally viable hypothesis to me. For example, Gopher trained on 20× more books, presumably due to Google’s mountain of proprietary scanned book data. (Gopher trained on MassiveText which has 4M books adding up to 2.1TB of text (27% sampling proportion), while the other two used The Pile which has 100.96 GiB of book text.) Books are probably very important, but the datasets differ in other ways too. MassiveText has 2.7TB of news (10% sampling proportion), presumably from decades of Google News, whereas The Pile seems to have only whatever news articles showed up in the general web scrape. Etc.
The hypothesis that Gopher is better mainly because DeepMind has more secret sauce or better-tuned hyperparameters or whatever seems also like a totally viable hypothesis, as far as I know.
So I don’t think this is very strong evidence either way, and indeed if anything I would suggest that it’s pushing a bit in the direction of data over algorithms, especially given that Gopher was earlier. Right? Sorry if I’m misunderstanding.
Thanks!! Quick question while I think over the rest:
What data are you plotting? Where exactly did you get it (i.e., what references)?
And why is the 2021 one better than the 2023 ones? Normally we would expect the other way around, right? Does DeepMind have so much secret sauce that it’s worth more than 2 years of public knowledge? Or are the other two groups making rookie mistakes? Or am I misunderstanding the plot?
Why is Gopher better than Pythia or Cerebras? Mostly no comment, but I think Pythia and Cerebras weren’t making any super simple obvious mistake but were behind 2021-era DeepMind.
I believe open data pretty strongly contradicts your claims
System efficiency: ~2x, not ~20x
§1.2 estimates “up to 20x” from system optimizations (quantization, parallelization, FlashAttention, etc.). But model FLOPs Utilization, the fraction of peak hardware FLOPs that large training runs actually achieve, has barely changed. Here are the most legible training run flops efficiencies:
Megatron-LM 2019: 30% on a single V100
PaLM 2022: 46% on 6K TPU v4s
Llama 3.1 2024: 38-43% on 16K H100s
MegaScale 2024: 55% on 12K GPUs
Llama 4 2025: 39% BF16-equivalent on 32K H100s
That’s less than 2x variation over 6 years. Most optimization work at scale is treading water against communication overhead as clusters grow larger, not increasing training compute per chip. Also, individual optimizations like FlashAttention’s 3x attention speedup are additive rather than multiplicative, and many other optimizations only apply to inference, not training.
The Gundlach paper tests 6 innovations. The field has produced 40+.
§1.1 relies on Gundlach et al. (2025b), which ablated six things, SwiGLU, RoPE, cosine decay, pre-RMSNorm, pre-LayerNorm, and AdamW. The most algorithmically advanced models with published methodology are DeepSeek-V3, DeepSeek-R1, and Kimi K2, and here’s a list of algorithms they used that Gundlach didn’t test:
Architecture:
Fine-grained MoE: 256 routed + 1 shared expert, 8 active per token
Ultra-sparse MoE: scaling to 384 experts at fixed active params
No token-dropping in MoE (enabled by better balancing)
Multi-Head Latent Attention (MLA): low-rank KV compression
Sigmoid gating with top-K normalization replacing softmax
Auxiliary-loss-free load balancing via learned bias terms
YaRN context extension (4K→128K in 2000 fine-tuning steps)
Training methodology:
Multi-token prediction with sequential causal chain
MuonClip optimizer: Muon + QK-Clip for stable training (zero loss spikes over 15.5T tokens)
Fill-in-Middle pretraining (10% of data in PSM order)
LLM-based data rephrasing for knowledge and math (outperforms naive multi-epoch)
Multi-phase learning rate schedule with annealing stages
Post-training:
GRPO: RL without a critic model
Four-stage pipeline: cold-start SFT → reasoning RL → rejection sampling SFT → all-scenario RL
Rule-based rewards avoiding neural reward model hacking
Of these, Muon alone is ~2x, which is on par with AdamW vs SGD.
HellaSwag: an example 10x+ compute multiplier, without distillation or synthetic data
Three model families, each trained at a fixed recipe across multiple sizes: Gopher (2021), Cerebras-GPT (2023, Chinchilla-optimal), and Pythia (2023), and they didn’t use any distillation, targeted data similar to benchmarks, or data contractors. Gopher reaches the same HellaSwag score as Cerebras-GPT or Pythia with roughly 10x less compute, and while using Kaplan parameter scaling. There was pretty much no legible algorithm in the paper that obviously caused this difference.
There’s lots more evidence, and many models such as Qwen or R1 got significantly higher compute multipliers, but as you note, some fraction of that is from distillation or data labelling.
Note: Claude 4.6 wrote most of this, with a huge number of revisions and fixes from me, it wasn’t really an uplift compared to only using Claude to make the list and the plot.
I just want to make it clear that both are paper and Epoch’s paper addresses innovations that occur from 2012-2023 (and only the first half of 2023). We are aware of MLA, muon optimizer, long context unlocks, and RL, and think they are important contributors. However, all these innovations are explicitly outside of the scope of our current paper which seeks to account for Epoch’s estimates in that time period.
Thanks for your feedback; I incorporated some of it in my rewrite (it’s now version 2). In particular, I appreciate the data showing FLOP utilization staying (roughly) constant, and the idea that there’s a red-queen race against communication overhead etc. And I added some of those examples from DeepSeek & Kimi in the appropriate sections. Thanks!
…But I do want to push back on your suggestion that your HellaSwag plot implies what you think it implies.
The hypothesis that Gopher is better than the other two mainly because of better training data seems like a totally viable hypothesis to me. For example, Gopher trained on 20× more books, presumably due to Google’s mountain of proprietary scanned book data. (Gopher trained on MassiveText which has 4M books adding up to 2.1TB of text (27% sampling proportion), while the other two used The Pile which has 100.96 GiB of book text.) Books are probably very important, but the datasets differ in other ways too. MassiveText has 2.7TB of news (10% sampling proportion), presumably from decades of Google News, whereas The Pile seems to have only whatever news articles showed up in the general web scrape. Etc.
The hypothesis that Gopher is better mainly because DeepMind has more secret sauce or better-tuned hyperparameters or whatever seems also like a totally viable hypothesis, as far as I know.
So I don’t think this is very strong evidence either way, and indeed if anything I would suggest that it’s pushing a bit in the direction of data over algorithms, especially given that Gopher was earlier. Right? Sorry if I’m misunderstanding.
Thanks!! Quick question while I think over the rest:
What data are you plotting? Where exactly did you get it (i.e., what references)?
And why is the 2021 one better than the 2023 ones? Normally we would expect the other way around, right? Does DeepMind have so much secret sauce that it’s worth more than 2 years of public knowledge? Or are the other two groups making rookie mistakes? Or am I misunderstanding the plot?
Gopher
Cerebras-GPT
Pythia
Why is Gopher better than Pythia or Cerebras? Mostly no comment, but I think Pythia and Cerebras weren’t making any super simple obvious mistake but were behind 2021-era DeepMind.