Thanks, I hadn’t looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what’s going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it’s really more like one data point. Of course the median won’t change, and I do prefer the median estimate because it is more robust to these outliers.)
There’s a related point, which is maybe what you’re getting at, which is that these results suffer from the exclusion of proprietary models for which we don’t have good compute estimates.
I agree this is a weakness but I don’t care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I’d usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
Another reason to think these models are not the main driver of the results is that there are high slopes in capability buckets that don’t include these models, such as 30, 35, 40 (log10 slopes of 1.22, 1.41, 1.22).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you’re measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is “it’s a lot bigger than 3x” I’m on board with that.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
These are important limitations, thanks for bringing them up!
Later models are more likely to have reasoning training
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.
Thanks, I hadn’t looked at the leave-one-out results carefully enough. I agree this (and your follow-up analysis rerun) means my claim is incorrect. Looking more closely at the graphs, in the case of Llama 3.1, I should have noticed that EXAONE 4.0 (1.2B) was also a pretty key data point for that line. No idea what’s going on with that model.
(That said, I do think going from 1.76 to 1.64 after dropping just two data points is a pretty significant change, also I assume that this is really just attributable to Grok 3 so it’s really more like one data point. Of course the median won’t change, and I do prefer the median estimate because it is more robust to these outliers.)
I agree this is a weakness but I don’t care about it too much (except inasmuch as it causes us to estimate algorithmic progress by starting with models like Grok). I’d usually expect it to cause estimates to be biased downwards (that is, the true number is higher than estimated).
This corresponds to 16-26x drop in cost per year? Those estimates seem reasonable (maybe slightly high) given you’re measuring drop in cost to achieve benchmark scores.
I do think that this is an overestimate of catch-up algorithmic progress for a variety of reasons:
Later models are more likely to be benchmaxxed
(Probably not a big factor, but who knows) Benchmarks get more contaminated over time
Later models are more likely to have reasoning training
None of these apply to the pretraining based analysis, though of course it is biased in the other direction (if you care about catch-up algorithmic progress) by not taking into account distillation or post-training.
I do think 3x is too low as an estimate for catch-up algorithmic progress, inasmuch as your main claim is “it’s a lot bigger than 3x” I’m on board with that.
Yep.
These are important limitations, thanks for bringing them up!
Can you say more about why this is a limitation / issue? Is this different from a 2008-2015 analysis saying “later models are more likely to use the transformer architecture,” where my response is “that’s algorithmic progress for ya”. One reason it may be different is that inference-time compute might be trading off against training compute in a way that we think make the comparison improper between low and high inference-compute models.
Yeah it’s just the reason you give, though I’d frame it slightly differently. I’d say that the point of “catch-up algorithmic progress” was to look at costs paid to get a certain level of benefit, and while historically “training compute” was a good proxy for cost, reasoning models change that since inference compute becomes decoupled from training compute.
I reread the section you linked. I agree that the tasks that models do today have a very small absolute cost such that, if they were catastrophically risky, it wouldn’t really matter how much inference compute they used. However, models are far enough from that point that I think you are better off focusing on the frontier of currently-economically-useful-tasks. In those cases, assuming you are using a good scaffold, my sense is that the absolute costs do in fact matter.