The paper you link is pretty interesting, but I don’t think it helps support the idea that general capabilities improvement is more data than algorithms. Instead, what they show good evidence for is that a little algorithmic progress has consisted of finding algorithms that give a flat bump to performance, while most of it has been from finding algorithms that scale better with the available compute.
They do ablation on a really tiny transformer (3M parameters), and show that there was a 3.something times improvement from adding modern algorithmic improvements. Meanwhile, the nanoGPT project has added basically the same improvements to gpt2 (124M parameters) and gotten a 10.something times improvement.
EDIT: That said, the “eyeball test” estimation of AI smarts might well be improving more because of better data and data-use than because of slightly lower loss on the Pile, I agree with that.
I don’t think it helps support the idea that it’s data and not algorithms
Agreed. Gundlach et al. are able to find and categorize specific algorithmic advances (non-data) that they claim explain 6,930× of gains, out of a total amount of gains estimated (“naively extrapolating”) by Ho et al. of 22,000×. That is, they explain all but another factor of 3 with algorithms. Quoting from the paper:
Though our experiments do not claim to be exhaustive, we compare our findings with estimates from the literature. Namely, between 2012 to 2023, Ho et al. [2024] found a doubling time of 8 months, or 2.83× per year, for a total efficiency gain of 22, 000×. In contrast, the growth rate of our CEG multiplier is approximately 2.23× annually, for a total of 6, 930×, of which 2, 700× (89%) is due to scale-dependent changes. This leaves a gap of 3.18× from our estimates, which could be from data selection, tokenizer advancements, or a long tail of innovations not captured in our analysis.
First off, making up for all but 3× is very good (frankly, I think too good and should be taken with a grain of salt). Second, naively reading this would imply data has contributed at most a factor of 3 over 11 years.
But I think the experiment in this paper use validation loss on a pretraining dataset, whereas performance on downstream tasks seems especially likely to be affected by better data (i.e., the 22,000× they are trying to account for might not even be influenced much by better data, as it too is based on loss).
(This comment is not meant to take a stand on the overall question of how much data vs. non-data algorithmic innovation has contributed, just the bearing of Gundlach et al., on this question.)
Coauthor of the paper here. It’s helpful to hear how people have interpreted our work, though I want to try to clear up a few things quickly.
Firstly, we don’t claim that data must make up the remaining gap, and we certainly are not claiming it explains capabilities improvements more broadly. We try to be pretty precise about the definition of algorithmic progress under the CEG framework, which is specifically computed using loss rather than downstream performance metrics (I agree, the relationship between these is not straightforward or well understood). We say explicitly that the CEG framework cannot capture a lot of important innovations for performance and efficiency (instruction fine-tuning, constitutional AI, parallelism, etc.). We also show highly counterintuitive, undesirable properties of the CEG framework which makes interpretation quite difficult.
Speaking for myself and not my coauthors here: It’s flattering you think our results are too good, though I’m not sure that was the intent of your comment :) I think the results would be interesting even if we over/undershot existing estimates by 10x. We find two important results, which are true regardless of our estimates’ alignment with the literature: scale seems to be necessary for much of the perceived algorithmic gains, and scale-dependence makes the framework for measuring these gains behave strangely. Those two facts are worth considering in their own right, and are true for even modest differences in scaling exponents across architectures. It is reaffirming that we recover so much of the other estimates, but strictly speaking, we could have found way more or way less, and our results would still raise important points.
Lastly, I think our results don’t point to “better data has made all the difference.” It really points to “a lot of new, specific things lead to a lot of new, specific differences in capabilities, and it’s hard to count those up using the same units.” I think that’s a really important (and difficult) direction for further research, which may shed light on how important data has been!
The paper you link is pretty interesting, but I don’t think it helps support the idea that general capabilities improvement is more data than algorithms. Instead, what they show good evidence for is that a little algorithmic progress has consisted of finding algorithms that give a flat bump to performance, while most of it has been from finding algorithms that scale better with the available compute.
They do ablation on a really tiny transformer (3M parameters), and show that there was a 3.something times improvement from adding modern algorithmic improvements. Meanwhile, the nanoGPT project has added basically the same improvements to gpt2 (124M parameters) and gotten a 10.something times improvement.
EDIT: That said, the “eyeball test” estimation of AI smarts might well be improving more because of better data and data-use than because of slightly lower loss on the Pile, I agree with that.
Agreed. Gundlach et al. are able to find and categorize specific algorithmic advances (non-data) that they claim explain 6,930× of gains, out of a total amount of gains estimated (“naively extrapolating”) by Ho et al. of 22,000×. That is, they explain all but another factor of 3 with algorithms. Quoting from the paper:
First off, making up for all but 3× is very good (frankly, I think too good and should be taken with a grain of salt). Second, naively reading this would imply data has contributed at most a factor of 3 over 11 years.
But I think the experiment in this paper use validation loss on a pretraining dataset, whereas performance on downstream tasks seems especially likely to be affected by better data (i.e., the 22,000× they are trying to account for might not even be influenced much by better data, as it too is based on loss).
(This comment is not meant to take a stand on the overall question of how much data vs. non-data algorithmic innovation has contributed, just the bearing of Gundlach et al., on this question.)
Coauthor of the paper here. It’s helpful to hear how people have interpreted our work, though I want to try to clear up a few things quickly.
Firstly, we don’t claim that data must make up the remaining gap, and we certainly are not claiming it explains capabilities improvements more broadly. We try to be pretty precise about the definition of algorithmic progress under the CEG framework, which is specifically computed using loss rather than downstream performance metrics (I agree, the relationship between these is not straightforward or well understood). We say explicitly that the CEG framework cannot capture a lot of important innovations for performance and efficiency (instruction fine-tuning, constitutional AI, parallelism, etc.). We also show highly counterintuitive, undesirable properties of the CEG framework which makes interpretation quite difficult.
Speaking for myself and not my coauthors here: It’s flattering you think our results are too good, though I’m not sure that was the intent of your comment :) I think the results would be interesting even if we over/undershot existing estimates by 10x. We find two important results, which are true regardless of our estimates’ alignment with the literature: scale seems to be necessary for much of the perceived algorithmic gains, and scale-dependence makes the framework for measuring these gains behave strangely. Those two facts are worth considering in their own right, and are true for even modest differences in scaling exponents across architectures. It is reaffirming that we recover so much of the other estimates, but strictly speaking, we could have found way more or way less, and our results would still raise important points.
Lastly, I think our results don’t point to “better data has made all the difference.” It really points to “a lot of new, specific things lead to a lot of new, specific differences in capabilities, and it’s hard to count those up using the same units.” I think that’s a really important (and difficult) direction for further research, which may shed light on how important data has been!
Thanks, I’ll edit the post to note I misinterpreted the paper.