The nature of LLM algorithmic progress (v2)

Steven Byrnes5 Feb 2026 19:17 UTC

78 points

(Heavily revised on Feb. 9, 2026—see changelog at the bottom.)

There’s a lot of talk about “algorithmic progress” in LLMs, especially in the context of exponentially-improving algorithmic efficiency. For example:

Epoch AI: “[training] compute required to reach a set performance threshold has halved approximately every 8 months”.
Dario Amodei 2025: “I’d guess the number today is maybe ~4x/year”.
Gundlach et al. 2025a “Price of Progress”: “Isolating out open models to control for competition effects and dividing by hardware price declines, we estimate that algorithmic efficiency progress is around 3× per year”.

It’s nice to see three independent sources reach almost exactly the same conclusion—halving times of 8 months, 6 months, and 7½ months respectively. Surely a sign that the conclusion is solid!

…Haha, just kidding! I’ll argue that these three bullet points are hiding three totally different stories. The first two bullets are about training efficiency, and I’ll argue that both are pretty misleading (each for a different reason!). The third is about inference efficiency, which I think is right, and mostly explained by distillation of ever-better frontier models into their “mini” cousins.

Tl;dr / outline

§1 is my attempted big-picture take on what “algorithmic progress” has looked like in LLMs. I split it into five categories:
- §1.1 is stereotypical “learning algorithm efficiency improvements” related to the core learning algorithm itself (architectures, tokenizers, optimizers, etc.). I’ll argue that the idea of using a Transformer with optimized hyperparameters was an important idea, but apart from that, the field has produced probably <10× of training efficiency improvements in this category in the entire period from 2018 to today (≈30%/year). That’s something, but it’s very much less than the exponentials suggested at the top.
- §1.2 is “unlockers of long context windows”, including things like MLA and YaRN, which are probably very important, but I’m not sure how to quantify that.
- §1.3 is “optimizations”, specific to a particular setup (hardware configuration, model scale, etc.). These constitute a never-ending stream of work (because the setup keeps changing), but there’s always a ceiling—it’s not an exponential that can keep growing and growing.
- §1.4 is “data-related improvements”, including proprietary human expert data, and various types of model distillation, both of which have important effects.
- §1.5 is “algorithmic changes that are not really quantifiable as ‘efficiency’”, including RLHF, RLVR, multimodality, and so on. No question that these are important, and we shouldn’t forget that they exist, but they’re not directly related to the exponential-improvement claims at the top.
§2 is why I don’t believe either Epoch AI or Dario, in their claims of exponential training-efficiency improvements (see top).
§3 is a quick sanity-check, studying “nanochat”, which matches GPT-2 performance but costs 600× less to train.
§4 is an optional bonus section on why I care about this topic in the first place. (Not what you expect! Unlike everyone else reading this, I don’t particularly care about forecasting future LLM progress.)

Status of this post

I wrote this very quickly, on a topic where I am not remotely an expert. I’m hoping for feedback and opinions!

1. The big picture of LLM algorithmic progress, as I understand it right now

1.1 Stereotypical learning algorithm improvements: there’s the Transformer itself, plus another maybe 10× since 2018

I’m defining this category as changes related to the core learning algorithm itself—neural architecture changes, SGD vs AdamW, etc.—apart from a couple categories that get their own sections below.

Here’s my current impression (any uncited claims probably come from @Hans Gundlach et al. 2025b “On the origin of algorithmic progress in AI”):

The replacement of LSTMs with the Transformer in 2017-2018 made a huge difference.′
- …And the more that you scale up, the bigger the LSTM-vs-Transformer difference becomes.
Using the right hyperparameters etc. for the Transformer is also very important, as usual in ML. But duh, everyone knows that! Therefore, the hyperparameters etc. have always been approximately optimal. …Except that there was one mistake which was corrected only in 2022, with the switch from “Kaplan scaling” to “Chinchilla-optimal scaling”.
- Why did the Chinchilla correction take so many years to appear? Probably some combination of: (1) it actually didn’t make that big a difference until the training runs got sufficiently huge, and then it was in fact discovered pretty soon after the discovery became importantly helpful; and (2) the hyperparameter sweep required to surface this error was very expensive, since it involved running a bunch of massive training runs.
Mixture-of-Experts (MoE) was like a 2× efficiency improvement.
- …But a commenter suggested that the (highly-optimized) DeepSeek-v3 MoE implementation had a number of bells and whistles that Gundlach et al. 2025b didn’t talk about. So maybe bump it up to 3×?
Gundlach et al. 2025b looked into five other things in this category—SwiGLU, pre-RMSNorm, rotary encoding, cosine decay learning rate schedule, and improvements to the tokenizer^[1]—and found that they added up to very little, like a factor of less than 1.5× total.^[2]
There seem to be more tricks that Gundlach et al. 2025b ignored.^[3] Let’s give it another 2×?

1.2 Innovations that unlock long context windows

If I understand correctly, the original Transformer could be run in principle with an arbitrarily long context window. But beyond a certain (low) length, the memory demands would catastrophically reduce FLOP utilization to ≈0. Then a 2019 advance (“MQA”) made longer context windows practical, but performed worse than the original (“MHA”) Transformer (holding context window fixed). Then further advances (MQA → GQA → MLA) clawed back most of that performance degradation, while keeping the low memory footprint.

Also in the “unlocking long context windows” category is things like YaRN, a trick to get good long-context performance out of mostly-short-context training data, by basically extrapolating a short-context trained attention layer into a good initialization for a longer-context trained attention layer. And then you need much less actual training, because the initial state is already so good.

Anyway, this category of innovations seems very important. Exactly how important? I don’t know! I can’t immediately find a reference that quantifies it in a legible-to-me way.

1.3 “Optimizations”: Lots of work, but there’s always a ceiling

This category is stuff that wouldn’t show up in a FLOP metric, but it’s equally important for cost. It includes everything specific to a particular hardware configuration and training setup—details about quantization, parallelization, FlashAttention, other CUDA black magic, and so on. I’ll also thro system-level optimizations like speculative decoding into this category.

My impression is: it takes a lot of work to keep FLOP utilization high as the chips grow ever faster and the parallelization grows ever more aggressive. (Per the Red Queen: “It takes all the running you can do, to keep in the same place.”) So if we’re trying to compute how “algorithmic efficiency” changes over time, I think we wouldn’t see much effect from this category. From the outside, it would look like: some company’s FLOP utilization (actual FLOP/s achieved compared to “peak FLOP/s” according to the GPU spec sheet) was X% last year, and it’s still around X% today, but they’re training bigger models. From that outside perspective, we would summarize the situation by saying that the company’s “algorithmic progress” is zero while their “compute growth” is high. But that summary would be hiding a lot of “optimization”-type innovation under the hood.

Alternatively, instead of comparing year-on-year as configurations change, we might hold fixed a hardware configuration and training setup, and ask how efficiency might change over time due to these kinds of “optimizations”. I think this multiplier could be quite big (say, 20×) if the baseline is a sufficiently early slapdash implementation. I think the multiplier would be smaller (but still substantial—say, 3×) even many months later, after some of the low-hanging fruit is gone, but a long tail of smaller “optimizations” still remains. For example, the original FlashAttention alone apparently sped up certain training setups by 3× wall-clock time (but other setups by less). Much later on, as a random example, the “nanoGPT speedrun” got a 9% wall-clock speed boost in 2024 by simply incrementing the PyTorch version.

What I don’t believe is that “optimizations” can contribute to an ever-growing exponential, where it’s 10× after two years, 100× after four years, 1000× after six years, etc. These kinds of optimizations have a ceiling, where you’re doing the best you can with the training approach and hardware configuration you’ve got. GPU utilization cannot exceed 100%. Quantization can’t go below 1 bit. Etc.

1.4 Data-related improvements

As discussed in “Most Algorithmic Progress is Data Progress” by Beren Millidge (@beren), a lot of LLM improvement has come from more and/or better training data,^[4] including:

Paying human experts to create high-quality proprietary training data;
Leveraging AI itself to create high-quality (synthetic) training data, especially by distillation from larger better models to smaller cheaper models, and/or distillation from more “thinking” time to less.
Maybe other things like filtering out bad training data, changing the order that the training data is presented, etc.^[5]

What are the impacts of these data-related improvements? Are they relevant to those exponentials at the top? My current take is:

Better data is almost definitely increasing the performance of the best models. (Otherwise companies would not be spending billions of dollars a year on expert human data!) Note that the “performance” in question here is probably benchmarks and applications, but not perplexity.
It follows that better data should also be increasing training efficiency (i.e., decreasing the training compute required to reach any given performance level), at least for the companies that have this kind of proprietary data on hand. But I don’t know quantitatively how big an effect that is.
If we switch topics from training efficiency to inference efficiency, then the point about synthetic data suddenly becomes extremely important: I propose that model distillation is the main explanation for the Gundlach et al. 2025a claim that inference compute has been dropping 3×/year, holding quality fixed. As the biggest and best models get ever bigger and better, the tiny distilled models get better too, thus surpassing quality thresholds that previously required a bigger model.

1.5 Algorithmic changes that are not really quantifiable as “efficiency”

If we put aside the “3×/year” and related quotes at the top, and take a broader view of what LLM algorithmic progress can look like, then of course we find many more items. These include:

RLHF (and DPO, Constitutional AI, etc.);
The rise of long-duration “reasoning” at inference time (and the modifications to training & inference that make this “reasoning” possible—most famously RLVR, but there are claims that certain non-RLVR approaches work equally well (1,2));
Multi-modality;
Tools and interfaces;
Etc.

2. Explaining away the two training-efficiency exponential claims

At the top, I cited Epoch AI and Dario Amodei as claiming that algorithmic improvements constitute a rapid exponential that’s been going on for years. I don’t currently believe either of them. Here’s why.

2.1 The Epoch “8-month halving time” claim seems to be mostly a weird artifact of their methodology

(The Epoch AI claim in question is at blog, paper, and my response here is entirely based on Gundlach et al. 2025b.)

Some algorithmic changes matter more and more as the model scale gets bigger. Specifically, there were two such changes: the switch from LSTMs to Transformers, and Chinchilla-optimal training.

For example, let’s suppose that the Transformer is N× more efficient than LSTMs with 2018-scale LLMs, and 10N× more efficient than LSTMs with 2025-scale LLMs.

Now let’s put aside everything else, and imagine a world where we switch from LSTMs to Transformers in 2018, and then scale up the transformers from 2018 to 2025 with no additional algorithmic change at all. In the funny Epoch methodology, they would say that we got an “extra” 10× algorithmic improvement (50%/year!) in the 2018-2025 period, because we’re kinda milking ever more advantage from the one-time LSTM-to-Transformer switch.

OK, but … that’s super confusing! Right? By assumption, the algorithms weren’t actually getting better during that 2018-2025 period, at all!

Anyway, the important point is: the actual Epoch analysis seems to be fully compatible with the claims I made in §1.1 above.

Refer to caption — The Gundlach et al. 2025b reanalysis of the supposed exponential efficiency improvement claimed by Epoch AI (2024): it’s mostly just the Transformer itself, and Chinchilla

(The only puzzle here is that Gundlach et al. 2025b claims to perfectly account for the Epoch “improvement” … but the things omitted by Gundlach et al. (especially the “unlockers of long context windows” in §1.2) should account for some of the Epoch “improvement” as well, right? So I figure that Gundlach et al. must have some minor inaccuracies that account for the balance.)

2.2 The Dario “4x/year” claim is I think largely confused double-counting?

I quoted Dario Amodei 2025 at the top. Here’s a longer version of that quote:

The field is constantly coming up with ideas, large and small, that make things more effective or efficient: it could be an improvement to the architecture of the model (a tweak to the basic Transformer architecture that all of today’s models use) or simply a way of running the model more efficiently on the underlying hardware. New generations of hardware also have the same effect. What this typically does is shift the curve: if the innovation is a 2x “compute multiplier” (CM), then it allows you to get 40% on a coding task for $5M instead of $10M; or 60% for $50M instead of $100M, etc. Every frontier AI company regularly discovers many of these CM’s: frequently small ones (~1.2x), sometimes medium-sized ones (~2x), and every once in a while very large ones (~10x). … In 2020, my team published a paper suggesting that the shift in the curve due to algorithmic progress is ~1.68x/year. That has probably sped up significantly since; it also doesn’t take efficiency and hardware into account. I’d guess the number today is maybe ~4x/year. Another estimate is here.

At first, I found this quote baffling. Dario has been working primarily on Transformer-based LLMs for at least 7 years. So I guess he’s saying that we’ve improved by $4^{7}$ = 16,000× just from algorithms? But, c’mon, that’s nuts! Right?

(Some commenters suggested that maybe Dario’s belief is that it’s 4×/year today, but lower in the past. My response: maybe that’s true to some small extent, but in context, I think this is sanewashing, and that Dario’s belief has to be at least ≳3000× in the last seven years.^[6] And I still think that’s nuts.)

So what the heck is Dario talking about??

…But I think I got it now. So here’s my current take—the only way I can get everything to fit together and make sense:

Maybe some of Dario’s “compute multipliers” are actually part of the category of “data-related improvements” (§1.4 above) (i.e., expert human data, different types of model distillation, etc.). I mean, it doesn’t really sound that way, from what he wrote, but sure, maybe.
Some of Dario’s “compute multipliers” are the ones listed in §1.1, such as MoE. But that’s not very much, if we’re ultimately trying to explain a total multiplier of ≳3000×.
Some of Dario’s “compute multipliers” are the long context window enablers in §1.2. Might Dario attribute a multiplier as high as, say, >100× to this category? And might he be correct to do so? I’m pretty skeptical! But I guess it’s possible. Not sure how to pin it down.
The rest of Dario’s “compute multipliers” are in the category of “optimizations” (§1.3 above).
- I think there’s probably some funny double-counting happening in this category.
- Example 1: Suppose Anthropic has 60% FLOP utilization on a training run. Then they buy more chips, which creates new interconnect issues, so their utilization drops to 40%. Then they do brilliant innovative CUDA magic to get it back up to 60%. I feel like maybe Dario might count that as a 1.5× compute multiplier from algorithmic progress. If so, I mean, yes he’s describing a real thing, but my strong preference would be to put that progress instead into the “more compute” bucket, in the sense that, at the end of the day, FLOP utilization stayed fixed while compute went up. If we categorize it as algorithmic progress, then we’re likely to double-count, because we’re probably also tallying Anthropic’s compute growth and multiplying those two numbers together.
- Example 2: Suppose Anthropic upgrades to a new chip that can efficiently support some more aggressive quantization scheme. Then Anthropic staff work on the learning algorithm and CUDA etc. to make that quantization scheme actually work without much performance loss. The result is 1.3× price reduction. Maybe Dario would count that as a 1.3× compute multiplier from “algorithmic progress”. If so, again, OK sure, you can say that. But from an outsider perspective, this would show up in the “hardware” not “algorithm” category, because at the end of the day, the change was that Anthropic switched to new chips with lower peak FLOP/$ (because they support fast lower-bit FLOP). Again, if we categorize this as algorithmic efficiency, then we’ll wind up double-counting.
- The upshot is: due to ceiling effects, I think that if §1.3 stuff is accounting for an appreciable fraction of Dario’s ≳3000×, then there must be a lot of this kind of double-counting going on.

Overall, I remain confused, and I think Dario is probably doing a lot of this double-counting stuff, and/or describing things in a confusing way.

[Boy, I sure feel weird about lecturing Dario Amodei on the big picture of LLM training! He knows more about LLM training than almost anyone on Earth, and I have (checks notes) no LLM training experience whatsoever. So if anyone has a different proposal for how to make sense of Dario’s quote above, I’m all ears!]

3. Sanity-check: nanochat

There were seven years between GPT-2 (trained in early 2019) and the new nanochat, which matches GPT-2 performance on the “CORE” metric (“a diverse set of reasoning and knowledge tasks from the DCLM benchmark suite”).

Remarkably, Karpathy says here that nanochat training costs $73 (“3 hours on a single 8XH100 node”), whereas “GPT-2 was trained by OpenAI on 32 TPU v3 chips for 168 hours (7 days), with $8/hour/TPUv3 back then, for a total cost of approx. $43K”.

So that would be a factor of 600 in 7 years, i.e. a halving time of 9 months. Is that consistent with my story in §1? I think so! For example, it might be something like:

6× from lower hardware costs (Epoch says FLOP/$ has increased 30%/year, and ${1.3}^{7}$ =6);
5× from “learning algorithm improvements” in §1.1;
2.5× from “optimizations” in §1.3;
- (Note also that there have been GPU hardware changes since 2019 that presumably make them better tailored to Transformers, not all of which would necessarily be reflected in peak-FLOP/$.)
8× from better data (§1.4)

For that last one: GPT-2 used “webtext”, which was generated by scraping URLs linked from Reddit. By contrast, nanochat trains on “fineweb-EDU”, a dataset of educational materials crafted and curated with incomparably more effort and care. Remember, we’re comparing nanochat to GPT-2 on “reasoning and knowledge tasks”, not perplexity; I would be shocked if this better data was not playing a major role.

So anyway, my take in §1 seems at least plausibly consistent with the nanochat thing, AFAICT at a glance. To be clear, I didn’t check things in detail or scrutinize it very hard. If anyone wants to really check, you could just download nanochat and have at it!

4. Optional bonus section: why does this matter?

Well for 99% of the people reading this post, this topic matters to you because you’re trying to forecast future LLM progress. But that’s not my interest, so I won’t talk about it. I’ll leave that to others!

I’m actually interested in a rather different debate, related to arguments about takeoff speeds for a hypothetical future AI paradigm—see Foom & Doom 1: “Brain in a box in a basement”, e.g. comment on that post by @ryan_greenblatt. Here’s the debate:

One school of thought (that I vaguely associate with Paul Christiano^[7]) says: When people are trying to do something in ML, they very rapidly get to near the ceiling of how efficiently they can do that thing, given the available data and hardware situation (but perhaps leaving aside paradigm shifts, which are not about doing the same thing more efficiently, but rather about trying to do something different instead).

A different school of thought says: No, that’s wrong, instead when people are trying to do something in ML, there will be a very large collection of algorithmic improvements that could make it work more efficiently, and these improvements will take many thousands of person-years to discover, and they will collectively amount to orders of magnitude of efficiency difference.

I’m generally in the first school of thought, which of course goes along with my belief that a future AI paradigm shift could lead to a remarkably sudden emergence of AGI and ASI.

…OK, if that’s the debate, then what lesson do we take away from this LLM case-study? My answer: If I’m right (a big “if”!), then (I would argue) the history of LLMs seems to support the first school of thought more than the second.

To be clear, I don’t think this kind of analogizing is terribly strong evidence either way; and note also that there are other case-studies like this ImageNet analysis that might or might not paint a different picture, I didn’t check.

In fact, there are two key disanalogies between LLMs versus the future AI paradigm I’m expecting (see Foom & Doom 1), and they make my case for next-paradigm sudden takeoff even stronger: the future paradigm I’m expecting would (1) not rely on training data for its capabilities (unlike LLMs), making §1.3 basically moot; and (2) require very little compute to get from random initialization to AGI (if efficiently implemented), which would allow for much more rapid iteration and testing than we’re used to from LLMs.

Thanks Hans Gundlach, Seth Herd, plex, algon, and ishaan for critical comments on earlier drafts.

Changelog

Feb. 9, 2026: I pulled “unlockers of long context windows” into their own category (new section §1.2), ; I increased my guess of the §1.1 impact from “3-5× beyond Transformer + Chinchilla” to “10×”. I heavily reframed the discussion of “optimizations” (now §1.3), to clarify what’s getting compared to what. I added a couple more examples to the “data” category (now §1.4). Then I made various related changes to my analysis of the Dario quote, and nanochat. Thanks to the commenters for ideas and pushback!

^
Gundlach links to this paper on tokenizers, and describes it as a claim that “inefficient tokenization” can cause up to 68% performance hit. But I think that 68% number comes from comparing best practices to the extraordinarily dumb idea of building a tokenizer using English-only text and then running inference on a multilingual corpus. As for real tokenizer improvements, I think everyone’s been using BPE since before the Transformer, and different flavors of BPE seem quite similar, if I’m reading the paper right. As an example, this page benchmarks the tokenizer of the recently-released nanochat (§3) against the tokenizer used by GPT-2 in 2019, and finds 0-15% compression difference, depending on the data type. The difference probably comes from using better training data to set up the tokenizer.
^
FYI: They note that if you revert one of these things at a time, it has a bigger deleterious impact than if you revert the whole package at once. In other words, the “Retro Transformer” and the “Modern Transformer” were each a package of components that worked particularly well together.
^
For example, Karpathy here mentions the “muon optimizer … residual pathways and skip connections gated by learnable scalars, and value embeddings”. And a commenter noted that the DeepSeek-v3 paper includes: “Sigmoid gating with top-K normalization replacing softmax”, “MuonClip optimizer”, and “Multi-phase learning rate schedule with annealing stages”. None of these seem to have been studied by Gundlach et al. 2025b, unless I missed it.
^
This of course fits perfectly with my belief that we should think of LLMs as getting their impressive powers almost entirely via imitation learning from their training corpus.
^
A commenter mentions a couple more examples: “fill-in-the-middle pretraining” (used in DeepSeek v3) and LLM-based data augmentation / rephrasing (used in Kimi K2).
^
For example, we can just multiply Dario’s “CMs”. E.g. if “every once in a while” means “three times ever”, then we would have 1000× just from the “very large” CMs, before we even start on the “medium” and “small” ones! Anyway, I’m willing to haggle over a factor of 2 or 5 or whatever, but I think it’s sanewashing to interpret Dario’s quote as claiming less than, say, ≳3000× over 7 years.
^
For example, Paul Christiano 2018: “A more precise version [of] my claim: if you gave smart grad students from 1990 access to all of the non-AI technology of 2017 (esp. software tools + hardware + data) and a big budget, it would not take them long to reach nearly state of the art performance on supervised learning and RL. For example, I think it’s pretty plausible that 20 good grad students could do it in 3 years if they were motivated and reasonably well managed.” (Actually, Paul is much further in the direction of the first school of thought than I am, because I defined it with a carve-out for possible hard-to-discover paradigm shifts, and he’s not even conceding that.)