Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3 (and Grok 4 when you include it). Those models are widely believed to be cases where a ton of compute was poured in to make up for poor algorithmic efficiency. If you remove those I expect your methodology would produce similar results as prior work (which is usually trying to estimate progress at the frontier of algorithmic efficiency, rather than efficiency progress at the frontier of capabilities).
I could imagine a reply that says “well, it’s a real fact that when you start with a model like Grok 3, the next models to reach a similar capability level will be much more efficient”. And this is true! But if you care about that fact, I think you should instead have two stylized facts, one about what happens when you are catching up to Grok or Llama, and one about what happens when you are catching up to GPT, Claude, or Gemini, rather than trying to combine these into a single estimate that doesn’t describe either case.
Your detailed results are also screaming at you that your method is not reliable. It is really not a good sign when your analysis that by construction has to give numbers in [1,∞) produces results that on the low end include 1.154, 2.112, 3.201 and on the high end include 19,399.837 and even (if you include Grok 4) 2.13E+09 and 2.65E+16 (!!)
I find this reasoning unconvincing because their appendix analysis (like that in this blog post) is based on more AI models than their primary analysis!
The primary evidence that the method is unreliable is not that the dataset is too small, it’s that the results span such a wide interval, and it seems very sensitive to choices that shouldn’t matter much.
Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3
I’m fairly sure this is not the case. In this appendix when I systematically drop one frontier model at a time and recalculate the slope for each bucket, Llama 3.1-405B isn’t even the most influential model for the >=25 bucket (the only bucket it’s frontier for)! And looking at the graph, that’s not surprising, it looks right on trend. Grok 3 also looks surprisingly on trend, and looking at that leave-one-out analysis, it is pretty influential, but even without it, the slope for that capability bucket is −3.5 order of magnitude per year. Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don’t include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
For thoroughness, I also just reran the analysis and totally excluded these data points and the results are basically the same, for confident and likely compute estimates (main result in the post) we get a weighted log10 mean of 1.64 (44×) and median of 1.21 (16×). I consider these to be quite in line with the main results (1.76, 1.21).
There’s a related point, which is maybe what you’re getting at, which is that these results suffer from the exclusion of proprietary models for which we don’t have good compute estimates. For example, o1 would have been the first models in Grok 3′s performance tier and plausibly used less compute—if we had a better compute estimate for it and it was less than Grok 3, Grok 3 wouldn’t have made the frontier. By definition the slope for that capability bucket would be less steep. I thought about trying to make my own compute estimates for such models but decided not to for the sake of project scope.
Thanks, I hadn’t looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what’s going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it’s really more like one data point. Of course the median won’t change, and I do prefer the median estimate because it is more robust to these outliers.)
There’s a related point, which is maybe what you’re getting at, which is that these results suffer from the exclusion of proprietary models for which we don’t have good compute estimates.
I agree this is a weakness but I don’t care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I’d usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don’t include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you’re measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is “it’s a lot bigger than 3x” I’m on board with that.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
These are important limitations, thanks for bringing them up!
Later models are more likely to have reasoning training
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.
Your detailed results are also screaming at you that your method is not reliable
I seems to me that they are screaming that we can’t be confident in the particular number output by these methods. And I’m not. I tried to be clear in this post that what I would consider the results from this method (16×–60× per year) are not my all-things-considered view (20×, with an 80% CI from 2×–200×).
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
The primary evidence that the method is unreliable is not that the dataset is too small, it’s that the results span such a wide interval, and it seems very sensitive to choices that shouldn’t matter much.
This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
Okay fair enough, I agree with that.
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don’t pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here’s my version of your prediction, where I’ll take your data at face value without checking:
[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.
Note that I’m implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I’m also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn’t exactly consider this equivalent to a bet, but I do think it’s something where people can and should use it to judge track records.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe
Super fair. I probably should have just asked what you anticipate observing that might differ from my expectation. I appreciate you writing your own version of the prediction, that’s basically what I wanted. And it sounds like I don’t even have enough money to make a bet you would consider worth your time!
As to our actual predictions, they seem quite similar to me, which is clarifying. I was under the impression you expected slower catch-up progress. A main prediction of 3e23 FLOP implies 1/(3e23/3.8e24) = 12.7× reduction in FLOP over a year, which I also consider quite likely!
I was under the impression you expected slower catch-up progress.
Note that I think the target we’re making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be “catch-up algorithmic progress” so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).
Your results are primarily driven by the inclusion of Llama 3.1-405B and Grok 3 (and Grok 4 when you include it).Those models are widely believed to be cases where a ton of compute was poured in to make up for poor algorithmic efficiency. If you remove those I expect your methodology would produce similar results as prior work (which is usually trying to estimate progress at the frontier of algorithmic efficiency, rather than efficiency progress at the frontier of capabilities).I could imagine a reply that says “well, it’s a real fact that when you start with a model like Grok 3, the next models to reach a similar capability level will be much more efficient”. And this is true! But if you care about that fact, I think you should instead have two stylized facts, one about what happens when you are catching up to Grok or Llama, and one about what happens when you are catching up to GPT, Claude, or Gemini, rather than trying to combine these into a single estimate that doesn’t describe either case.
Your detailed results are also screaming at you that your method is not reliable. It is really not a good sign when your analysis that by construction has to give numbers in [1,∞) produces results that on the low end include 1.154, 2.112, 3.201 and on the high end include 19,399.837 and even (if you include Grok 4) 2.13E+09 and 2.65E+16 (!!)
The primary evidence that the method is unreliable is not that the dataset is too small, it’s that the results span such a wide interval, and it seems very sensitive to choices that shouldn’t matter much.
I’m fairly sure this is not the case. In this appendix when I systematically drop one frontier model at a time and recalculate the slope for each bucket, Llama 3.1-405B isn’t even the most influential model for the >=25 bucket (the only bucket it’s frontier for)! And looking at the graph, that’s not surprising, it looks right on trend. Grok 3 also looks surprisingly on trend, and looking at that leave-one-out analysis, it is pretty influential, but even without it, the slope for that capability bucket is −3.5 order of magnitude per year. Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don’t include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
For thoroughness, I also just reran the analysis and totally excluded these data points and the results are basically the same, for confident and likely compute estimates (main result in the post) we get a weighted log10 mean of 1.64 (44×) and median of 1.21 (16×). I consider these to be quite in line with the main results (1.76, 1.21).
There’s a related point, which is maybe what you’re getting at, which is that these results suffer from the exclusion of proprietary models for which we don’t have good compute estimates. For example, o1 would have been the first models in Grok 3′s performance tier and plausibly used less compute—if we had a better compute estimate for it and it was less than Grok 3, Grok 3 wouldn’t have made the frontier. By definition the slope for that capability bucket would be less steep. I thought about trying to make my own compute estimates for such models but decided not to for the sake of project scope.
Thanks, I hadn’t looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what’s going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it’s really more like one data point. Of course the median won’t change, and I do prefer the median estimate because it is more robust to these outliers.)
I agree this is a weakness but I don’t care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I’d usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you’re measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is “it’s a lot bigger than 3x” I’m on board with that.
Yep.
These are important limitations, thanks for bringing them up!
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.
I seems to me that they are screaming that we can’t be confident in the particular number output by these methods. And I’m not. I tried to be clear in this post that what I would consider the results from this method (16×–60× per year) are not my all-things-considered view (20×, with an 80% CI from 2×–200×).
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
Okay fair enough, I agree with that.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don’t pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here’s my version of your prediction, where I’ll take your data at face value without checking:
Note that I’m implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I’m also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn’t exactly consider this equivalent to a bet, but I do think it’s something where people can and should use it to judge track records.
Super fair. I probably should have just asked what you anticipate observing that might differ from my expectation. I appreciate you writing your own version of the prediction, that’s basically what I wanted. And it sounds like I don’t even have enough money to make a bet you would consider worth your time!
As to our actual predictions, they seem quite similar to me, which is clarifying. I was under the impression you expected slower catch-up progress. A main prediction of 3e23 FLOP implies 1/(3e23/3.8e24) = 12.7× reduction in FLOP over a year, which I also consider quite likely!
Thanks for your engagement!
Note that I think the target we’re making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be “catch-up algorithmic progress” so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).