I believe open data pretty strongly contradicts your claims
System efficiency: ~2x, not ~20x
§1.2 estimates “up to 20x” from system optimizations (quantization, parallelization, FlashAttention, etc.). But model FLOPs Utilization, the fraction of peak hardware FLOPs that large training runs actually achieve, has barely changed. Here are the most legible training run flops efficiencies:
Megatron-LM 2019: 30% on a single V100
PaLM 2022: 46% on 6K TPU v4s
Llama 3.1 2024: 38-43% on 16K H100s
MegaScale 2024: 55% on 12K GPUs
Llama 4 2025: 39% BF16-equivalent on 32K H100s
That’s less than 2x variation over 6 years. Most optimization work at scale is treading water against communication overhead as clusters grow larger, not increasing training compute per chip. Also, individual optimizations like FlashAttention’s 3x attention speedup are additive rather than multiplicative, and many other optimizations only apply to inference, not training.
The Gundlach paper tests 6 innovations. The field has produced 40+.
§1.1 relies on Gundlach et al. (2025b), which ablated six things, SwiGLU, RoPE, cosine decay, pre-RMSNorm, pre-LayerNorm, and AdamW. The most algorithmically advanced models with published methodology are DeepSeek-V3, DeepSeek-R1, and Kimi K2, and here’s a list of algorithms they used that Gundlach didn’t test:
Architecture:
Fine-grained MoE: 256 routed + 1 shared expert, 8 active per token
Ultra-sparse MoE: scaling to 384 experts at fixed active params
No token-dropping in MoE (enabled by better balancing)
Multi-Head Latent Attention (MLA): low-rank KV compression
Sigmoid gating with top-K normalization replacing softmax
Auxiliary-loss-free load balancing via learned bias terms
YaRN context extension (4K→128K in 2000 fine-tuning steps)
Training methodology:
Multi-token prediction with sequential causal chain
MuonClip optimizer: Muon + QK-Clip for stable training (zero loss spikes over 15.5T tokens)
Fill-in-Middle pretraining (10% of data in PSM order)
LLM-based data rephrasing for knowledge and math (outperforms naive multi-epoch)
Multi-phase learning rate schedule with annealing stages
Post-training:
GRPO: RL without a critic model
Four-stage pipeline: cold-start SFT → reasoning RL → rejection sampling SFT → all-scenario RL
Rule-based rewards avoiding neural reward model hacking
Of these, Muon alone is ~2x, which is on par with AdamW vs SGD.
HellaSwag: an example 10x+ compute multiplier, without distillation or synthetic data
Three model families, each trained at a fixed recipe across multiple sizes: Gopher (2021), Cerebras-GPT (2023, Chinchilla-optimal), and Pythia (2023), and they didn’t use any distillation, targeted data similar to benchmarks, or data contractors. Gopher reaches the same HellaSwag score as Cerebras-GPT or Pythia with roughly 10x less compute, and while using Kaplan parameter scaling. There was pretty much no legible algorithm in the paper that obviously caused this difference.
There’s lots more evidence, and many models such as Qwen or R1 got significantly higher compute multipliers, but as you note, some fraction of that is from distillation or data labelling.
Note: Claude 4.6 wrote most of this, with a huge number of revisions and fixes from me, it wasn’t really an uplift compared to only using Claude to make the list and the plot.
Gopher
Cerebras-GPT
Pythia
Why is Gopher better than Pythia or Cerebras? Mostly no comment, but I think Pythia and Cerebras weren’t making any super simple obvious mistake but were behind 2021-era DeepMind.