Your detailed results are also screaming at you that your method is not reliable
I seems to me that they are screaming that we can’t be confident in the particular number output by these methods. And I’m not. I tried to be clear in this post that what I would consider the results from this method (16×–60× per year) are not my all-things-considered view (20×, with an 80% CI from 2×–200×).
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
The primary evidence that the method is unreliable is not that the dataset is too small, it’s that the results span such a wide interval, and it seems very sensitive to choices that shouldn’t matter much.
This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
Okay fair enough, I agree with that.
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don’t pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here’s my version of your prediction, where I’ll take your data at face value without checking:
[DeepSeek-V3.2-Exp is estimated by Epoch to be trained with 3.8e24 FLOP. It reached an AAII index score of 65.9 and was released on September 29, 2025. It is on the compute-efficiency frontier.] I predict that by September 29, 2026, the least-compute-used-to-train model that reaches a score of 65 will be trained with around 3e23 FLOP, with the 80% CI covering 6e22–1e24 FLOP.
Note that I’m implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I’m also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn’t exactly consider this equivalent to a bet, but I do think it’s something where people can and should use it to judge track records.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe
Super fair. I probably should have just asked what you anticipate observing that might differ from my expectation. I appreciate you writing your own version of the prediction, that’s basically what I wanted. And it sounds like I don’t even have enough money to make a bet you would consider worth your time!
As to our actual predictions, they seem quite similar to me, which is clarifying. I was under the impression you expected slower catch-up progress. A main prediction of 3e23 FLOP implies 1/(3e23/3.8e24) = 12.7× reduction in FLOP over a year, which I also consider quite likely!
I was under the impression you expected slower catch-up progress.
Note that I think the target we’re making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be “catch-up algorithmic progress” so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).
I seems to me that they are screaming that we can’t be confident in the particular number output by these methods. And I’m not. I tried to be clear in this post that what I would consider the results from this method (16×–60× per year) are not my all-things-considered view (20×, with an 80% CI from 2×–200×).
Speaking colloquially, I might say “these results indicate to me that catch-up algorithmic progress is on the order of one or 1.5 orders of magnitude per year rather than half an order of magnitude per year as I used to think”. And again, my previous belief of 3× per year was a belief that I should have known was incorrect because it’s based only on pre-training.
This was helpful clarification, thanks. In the present analysis, the results span a wide interval, but the lower end of that interval is still generally higher than my prior!
As I said in footnote 9, I am willing to make bets about my all-things-considered beliefs. You think I’m updating too much based on unreliable methods? Okay come take my money.
Okay fair enough, I agree with that.
I find this attitude weird. It takes a lot of time to actually make and settle a bet. (E.g. I don’t pay attention to Artificial Analysis and would want to know something about how they compute their numbers.) I value my time quite highly; I think one of us would have to be betting seven figures, maybe six figures if the disagreement was big enough, before it looked good even in expectation (ie no risk aversion) as a way for me to turn time into money.
I think it’s more reasonable as a matter of group rationality to ask that an interlocutor say what they believe, so in that spirit here’s my version of your prediction, where I’ll take your data at face value without checking:
Note that I’m implicitly doing a bunch of deference to you here (e.g. that this is a reasonable model to choose, that AAII will behave reasonably regularly and predictably over the next year), though tbc I’m also using other not-in-post heuristics (e.g. expecting that DeepSeek models will be more compute efficient than most). So, I wouldn’t exactly consider this equivalent to a bet, but I do think it’s something where people can and should use it to judge track records.
Super fair. I probably should have just asked what you anticipate observing that might differ from my expectation. I appreciate you writing your own version of the prediction, that’s basically what I wanted. And it sounds like I don’t even have enough money to make a bet you would consider worth your time!
As to our actual predictions, they seem quite similar to me, which is clarifying. I was under the impression you expected slower catch-up progress. A main prediction of 3e23 FLOP implies 1/(3e23/3.8e24) = 12.7× reduction in FLOP over a year, which I also consider quite likely!
Thanks for your engagement!
Note that I think the target we’re making quantitative forecasts about will tend to overestimate that-which-I-consider-to-be “catch-up algorithmic progress” so I do expect slower catch-up progress than the naive inference from my forecast (ofc maybe you already factored that in).