I think most things mentioned in 1.4 (“Algorithmic changes that are not really quantifiable as efficiency”) belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
Especially the invention and subsequent improvement of RLVR has made things possible (advanced math, programming, agentic tool-use, answering of any questions that require a non-trivial amount of reasoning) that were far out of reach for older frontier models like GPT-3/GPT-4, which didn’t have any robust reasoning ability, apart from the very brittle (hallucination-prone) “think step by step” trick.
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms. E.g. AlphaGo Zero would clearly count as pure algorithmic improvement, because the synthetic data generated during self play is itself the output of an improved algorithm. By the way, more recent forms of RLVR also include self play, which hasn’t been appreciated enough in my opinion. (Self play can be classified as a weak form of RSI that is distinct from classical RSI as an AI that does AI research.)
Even in forms of RLVR which rely on human-written training tasks without self play, data-related improvement is not independent of algorithmic progress from RLVR: Because the data would be useless without first inventing RLVR. So it is difficult (as Gundlach apparently tries) to say which of them caused “more” of the progress they caused in combination.
Regarding distillation:
I propose that model distillation is the main explanation for the Gundlach et al. 2025a claim that inference compute has been dropping 3×/year, holding quality fixed. As the biggest and best models get ever bigger and better, the tiny distilled models get better too, thus surpassing quality thresholds that previously required a bigger model.
I think model distillation would not cause such a large and ongoing improvement in inference efficiency. When you first invent model distillation (a purely algorithmic type of progress), there is indeed a large one-time efficiency improvement, because a small model created from distillation is suddenly much better than another small model created without distillation. However, subsequently to that, there is no such improvement anymore, i.e. the relative difference between Gemini 2.5 Flash and Gemini 3 Flash (distilled models) presumably matches approximately the relative difference between Gemini 2.5 Pro and Gemini 3 Pro. So you now have 0x further improvement from model distillation, as the improvement in the smaller models matches the improvement in the larger ones. Unless the distillation method itself improved—but this would again count as progress caused by an algorithmic efficiency improvement.
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms…
I have now changed the text in a few places to better clarify how I am defining the scope of §1.1.
I feel like you’re maybe reading in some subtext, where you think I’m trying to downplay the things outside §1.1, and suggest that they don’t really count, or something? If so, that’s not what I meant to suggest, and it’s not how I feel in my heart. I’m open to rewording more, if you have suggestions.
I think most things mentioned in 1.4 (“Algorithmic changes that are not really quantifiable as efficiency”) belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
In the context of this post, I’m mainly interested in: (1) are the things in §1.4 relevant to the Epoch claim of exponential algorithmic improvements? and (2) are the things in §1.4 relevant to the Dario claim of exponential algorithmic improvements? It seems to me that the answer in both cases is “no”.
(1) In the Epoch case, I believe they quantified performance by perplexity not benchmarks.
(2) In the Dario case, I mean, I keep reading and re-reading the exact wording of the excerpt where he talks about “compute multipliers”. And it just really doesn’t sound to me like he is referring to SFT, RLHF, or RLVR in that excerpt (nor anything else in §1.4). Admittedly, his wording is a bit vague and confusing (to me). I’m open to discussion.
I think model distillation would not cause such a large and ongoing improvement in inference efficiency.
Pick a fixed model size, let’s say N=50B parameters. My current belief is that: if you straightforwardly distill Claude Opus 3.5 into an N-parameter model, then you wind up with a worse model, than if you straightforwardly distill Claude Opus 4.5 into an N-parameter model.
Are you disagreeing with that?
If you agree, then it would follow that (for example) maybe:
Alice can distill Opus 3.5 into a 100B-parameter model
Bob can distill Opus 4.5 into a 40B-parameter model
…And the two models may have the same benchmarks
(because Bob’s better starting point is making up for his more aggressive distillation). Thus we would see ever-falling inference costs at any given level of benchmarks. See what I mean?
I think most things mentioned in 1.4 (“Algorithmic changes that are not really quantifiable as efficiency”) belong to 1.1 (algorithmic efficiency progress) because they can actually be quantified as efficiency improvements, namely SFT, RLHF, RLVR. These have strongly increased capabilities, as measured by benchmarks, compared to GPT-3-style prompt engineering of the underlying base model. So a much smaller model with these improvements can get to the performance of a larger base model without them.
Especially the invention and subsequent improvement of RLVR has made things possible (advanced math, programming, agentic tool-use, answering of any questions that require a non-trivial amount of reasoning) that were far out of reach for older frontier models like GPT-3/GPT-4, which didn’t have any robust reasoning ability, apart from the very brittle (hallucination-prone) “think step by step” trick.
I would also include improvements from synthetic training data as an algorithmic improvement, not a data-related improvement, because better synthetic training data is created by better algorithms. E.g. AlphaGo Zero would clearly count as pure algorithmic improvement, because the synthetic data generated during self play is itself the output of an improved algorithm. By the way, more recent forms of RLVR also include self play, which hasn’t been appreciated enough in my opinion. (Self play can be classified as a weak form of RSI that is distinct from classical RSI as an AI that does AI research.)
Even in forms of RLVR which rely on human-written training tasks without self play, data-related improvement is not independent of algorithmic progress from RLVR: Because the data would be useless without first inventing RLVR. So it is difficult (as Gundlach apparently tries) to say which of them caused “more” of the progress they caused in combination.
Regarding distillation:
I think model distillation would not cause such a large and ongoing improvement in inference efficiency. When you first invent model distillation (a purely algorithmic type of progress), there is indeed a large one-time efficiency improvement, because a small model created from distillation is suddenly much better than another small model created without distillation. However, subsequently to that, there is no such improvement anymore, i.e. the relative difference between Gemini 2.5 Flash and Gemini 3 Flash (distilled models) presumably matches approximately the relative difference between Gemini 2.5 Pro and Gemini 3 Pro. So you now have 0x further improvement from model distillation, as the improvement in the smaller models matches the improvement in the larger ones. Unless the distillation method itself improved—but this would again count as progress caused by an algorithmic efficiency improvement.
Thanks for the feedback!
I have now changed the text in a few places to better clarify how I am defining the scope of §1.1.
I feel like you’re maybe reading in some subtext, where you think I’m trying to downplay the things outside §1.1, and suggest that they don’t really count, or something? If so, that’s not what I meant to suggest, and it’s not how I feel in my heart. I’m open to rewording more, if you have suggestions.
In the context of this post, I’m mainly interested in: (1) are the things in §1.4 relevant to the Epoch claim of exponential algorithmic improvements? and (2) are the things in §1.4 relevant to the Dario claim of exponential algorithmic improvements? It seems to me that the answer in both cases is “no”.
(1) In the Epoch case, I believe they quantified performance by perplexity not benchmarks.
(2) In the Dario case, I mean, I keep reading and re-reading the exact wording of the excerpt where he talks about “compute multipliers”. And it just really doesn’t sound to me like he is referring to SFT, RLHF, or RLVR in that excerpt (nor anything else in §1.4). Admittedly, his wording is a bit vague and confusing (to me). I’m open to discussion.
Pick a fixed model size, let’s say N=50B parameters. My current belief is that: if you straightforwardly distill Claude Opus 3.5 into an N-parameter model, then you wind up with a worse model, than if you straightforwardly distill Claude Opus 4.5 into an N-parameter model.
Are you disagreeing with that?
If you agree, then it would follow that (for example) maybe:
Alice can distill Opus 3.5 into a 100B-parameter model
Bob can distill Opus 4.5 into a 40B-parameter model
…And the two models may have the same benchmarks
(because Bob’s better starting point is making up for his more aggressive distillation). Thus we would see ever-falling inference costs at any given level of benchmarks. See what I mean?