How much progress in ML depends on algorithmic progress, scaling compute, or scaling relevant datasets is relatively poorly understood. In our paper, we make progress on this question by investigating algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision.
Using a dataset of a hundred computer vision models, we estimate a model—informed by neural scaling laws—that enables us to analyse the rate and nature of algorithmic advances. We use Shapley values to produce decompositions of the various drivers of progress computer vision and estimate the relative importance of algorithms, compute, and data.
Our main results include:
Every nine months, the introduction of better algorithms contributes the equivalent of a doubling of compute budgets. This is much faster than the gains from Moore’s law; that said, there’s uncertainty (our 95% CI spans 4 to 25 months)
Roughly, progress in image classification has been ~45% due to the scaling of compute, ~45% due to better algorithms, ~10% due to scaling data
The majority (>75%) of algorithmic progress is compute-augmenting (i.e. enabling researchers to use compute more effectively), a minority of it is data-augmenting
In our work, we revisit a question previously investigated by Hernandez and Brown (2020), which had been discussed on LessWrong by Gwern, and Rohin Shah. Hernandez and Brown (2020) re-implement 15 open-source popular models and find a 44-fold reduction in the compute required to reach the same level of performance as AlexNet, indicating that algorithmic progress outpaces the original Moore’s law rate of improvement in hardware efficiency, doubling effective compute every 16 months.
A problem with their approach is that it is sensitive to the exact benchmark and threshold pair that one chooses. Choosing easier-to-achieve thresholds makes algorithmic improvements look less significant, as the scaling of compute easily brings early models within reach of such a threshold. By contrast, selecting harder-to-achieve thresholds makes it so that algorithmic improvements explain almost all of the performance gain. This is because early models might need arbitrary amounts of compute to achieve the performance of today’s state-of-the-art models. We show that the estimates of the pace of algorithmic progress with this approach might vary by around a factor of ten, depending on whether an easy or difficult threshold is chosen. [1]
Our work sheds new light on how algorithmic efficiency occurs, namely that it primarily operates through relaxing compute-bottlenecks rather than through relaxing data-bottlenecks. It further offers insight on how to use observational (rather than experimental) data to advance our understanding of algorithmic progress in ML.
- ^
That said, our estimates is consistent with Hernandez and Brown (2020)’s estimate that algorithmic progress doubles the amount of effective compute every 16 months, as our 95% confidence interval ranges from 4 to 25 months.
Thanks for this!
Question: Do you have a sense of how strongly compute and algorithms are complements vs substitutes in this dataset?
(E.g. if you compare compute X in 2022, compute (k^2)X in 2020, and kX in 2021: if there’s a k such that the last one is better than both the former two, that would suggest complementarity)
I think this question is interesting but difficult to answer based on the data we have, because the dataset is so poor when it comes to unusual examples that would really allow us to answer this question with confidence. Our model assumes that they are substitutes, but that’s not based on anything we infer from the data.
Our model is certainly not exactly correct, in the sense that there should be some complementarity between compute and algorithms, but the complementarity probably only becomes noticeable for extreme ratios between the two contributions. One way to think about this is that we can approximate a CES production function
Y(C,A)=(αCρ+(1−α)Aρ)1/ρ
in training compute C and algorithmic efficiency A when C/A≈1 by writing it as
Y(C,A)=AY(elog(C/A),1)=A(αeρlog(C/A)+(1−α))1/ρ≈A(1+αρlog(C/A))1/ρ≈CαA1−α
which means the first-order behavior of the function around C/A≈1 doesn’t depend on ρ, which is the parameter that controls complementarity versus substitutability. Since people empirically seem to train models in the regime where C/A is close to 1 this makes it difficult to identify ρ from the data we have, and approximating by a Cobb-Douglas (which is what we do) does about as well as anything else. For this reason, I would caution against using our model to predict the performance of models that have an unusual combination of dataset size, training compute, and algorithmic efficiency.
In general, a more diverse dataset containing models trained with unusual values of compute and data for the year that they were trained in would help our analysis substantially. There are some problems with doing this experiment ourselves: for instance, techniques used to train larger models often perform worse than older methods if we try to scale them down. So there isn’t much drive to make algorithms run really well with small compute and data budgets, and that’s going to bias us towards thinking we’re more bottlenecked by compute and data than we actually are.
Interesting, thanks! To check my understanding:
In general, as time passes, all the researcheres increase their compute usage at a similar rate. This makes it hard to distinguish between improvements caused by compute and algorithmic progress.
If the correlation between year and compute was perfect, we wouldn’t be able to do this at all.
But there is some variance in how much compute is used in different papers, each year. This variance is large enough that we can estimate the first-order effects of algorithmic progress and compute usage.
But complementarity is a second-order effect, and the data doesn’t contain enough variation/data-points to give a good estimate of second-order effects.
This looks correct to me—this is indeed how the model is able to disentangle algorithmic progress from scaling of training compute budgets.
The problems you mention are even more extreme with dataset size because plenty of the models in our analysis were only trained on ImageNet-1k, which has around 1M images. So more than half of the models in our dataset actually just use the exact same training set, which makes our model highly uncertain about the dataset side of things.
In addition, the way people typically incorporate extra data is by pretraining on bigger, more diverse datasets and then fine-tuning on ImageNet-1k. This is obviously different from sampling more images from the training distribution of ImageNet-1k, though bigger datasets such as ImageNet-21k are constructed on purpose to be similar in distribution to ImageNet-1k. We actually tried to take this into account explicitly by introducing some kind of transfer exponent between different datasets, but this didn’t really work better than our existing model.
One final wrinkle is the irreducible loss of ImageNet. I tried to get some handle on this by reading the literature, and I think I would estimate a lower bound of maybe 1-2% for top 1 accuracy, as this seems to be the fraction of images that have incorrect labels. There’s a bigger fraction of images that could plausibly fit multiple categories at once, but models seem to be able to do substantially better than chance on these examples, and it’s not clear when we can expect this progress to cap out.
Our model specification assumes that in the infinite compute and infinite data limit you reach 100% accuracy. This is probably not exactly right because of irreducible loss, but because models are currently around over 90% top-1 accuracy I think it’s probably not too big of a problem for within-distribution inference, e.g. answering questions such as “how much software progress did we see over the past decade”. Out-of-distribution inference is a totally different game and I would not trust our model with this for a variety of reasons—the biggest reason is really the lack of diversity and the limited size of the dataset and doesn’t have much to do with our choice of model.
To be honest, I think ImageNet-1k is just a bad benchmark for evaluating computer vision models. The reason we have to use it here is that all the better benchmarks that correlate better with real-world use cases have been developed recently and we have no data on how past models perform on these benchmarks. When we were starting this investigation we had to make a tradeoff between benchmark quality and the size & diversity of our dataset, and we ended up going for ImageNet-1k top 1 accuracy for this reason. With better data on superior benchmarks, we would not have made this choice.
Any speculations on the implications for the rate of algorithmic progress on AGI/TAI/etc. (where algorithmic progress here means how fast the necessary training compute decreases over time), given that AGI is a different kind of “task,” and it’s a “task” that hasn’t yet been “solved,” and the ways of making progress are more diverse?
I would guess that making progress on AGI would be slower. Here are two reasons I think are particularly important:
ImageNet accuracy is a metric that can in many ways be gamed; so you can make progress on ImageNet that is not transferable to more general image classification tasks. As an example of this, in this paper the authors conduct experiments which confirm that adversarially robust training on ImageNet degrades ImageNet test or validation accuracy, but robustly trained models generalize better to classification tasks on more diverse datasets when fine-tuned on them.
This indicates that a lot of the progress on ImageNet is actually “overlearning”: it doesn’t generalize in a useful way to tasks we actually care about in the real world. There’s good reason to believe that part of overlearning would show up as algorithmic progress in our framework, as people can adapt their models better to ImageNet even without extra compute or data.
Researchers have stronger feedback loops on ImageNet: they can try something directly on the benchmark they care about, see the results and immediately update on their findings. This allows them to iterate much faster and iteration is a crucial component of progress in any engineering problem. In contrast, our iteration loops towards AGI operate at considerably lower frequencies. This point is also made by Ajeya Cotra in her biological anchors report, and it’s why she chooses to cut the software progress speed estimates from Hernandez and Brown (2020) in half when computing her AGI timelines.
Such an adjustment seems warranted here, but I think the way Cotra does it is not very principled and certainly doesn’t do justice to the importance of the question of software progress.
Overall I agree with your point that training AGI is a different kind of task. I would be more optimistic about progress in a very broad domain such as computer vision or natural language processing translating to progress towards AGI, but I suspect the conversion will still be significantly less favorable than any explicit performance metric would suggest. I would not recommend using point estimates of software progress on the order of a doubling of compute efficiency per year for forecasting timelines.
How can I convert “percents” of progress into multipliers? That is, progress= a*b, but percents assume a+b.
For example, if progress is 23 times, and 65 percent of it is a, how many times is it?
You would do it in log space (or geometrically). For your example, the answer would be 230.65≈7.67.