So this post brought to you by Beren today is about how a lot of claims about within-paradigm algorithmic progress is actually mostly about just getting better data, leading to a Flynn effect, and the reason I’m mentioning this is because once we have to actually build new fabs and we run out of data in 2028-2031, progress will be slower than people expect (assuming we havent reached AGI by then).
Edit: I incorrectly interpreted what the paper’s results actually were for the question of data vs non-data inputs to algorithmic progress.
When forecasting AI progress, the forecasters and modellers often break AI progress down into two components: increased compute, and ‘algorithmic progress’. My argument here is that the term ‘algorithmic progress’ for ‘the remainder after compute’ is misleading and that we should really think about and model AI progress as three terms – compute, algorithms, and data. My claim is that a large fraction (but certainly not all) AI progress that is currently conceived as ‘algorithmic progress’ is actually ‘data progress’, and that this term ‘algorithmic’ gives a false impression about what are the key forces and key improvements that have driven AI progress in the past three years or so.
From experience in the field, there have not been that many truly ‘algorithmic’ improvements with massive impact. The primary one of course is the switch to RLVR and figuring out how to do mid-training (although both of these are vitally dependent upon the datasets). Other minor ones include things like qk-norm, finegrained experts and improvement to expert balancing, and perhaps the muon optimizer. The impact of most of these is utterly dwarfed by ‘better’ data, however, and this is something that pure scaling and flop-based analyses miss.
Models today are certainly trained using vastly more flops than previously, but they are also trained on significantly ‘higher quality’ data where ‘high quality’ means aligned with the specific tasks we care about the models being able to perform (cynically: the evals). The models are not getting so good by scale alone. A GPT4 scale model trained on the dataset of GPT3 would be substantially worse across all benchmarks, even if we somehow replicated the GPT3-dataset to be the scale of GPT4s dataset. However this model was never released (and probably never trained) so improvements in data are easily hidden and misattributed to scale or other progress. An easy way to see this is to look at model improvements for a fixed flop count and model size. These improvements have been substantial and projects as models like the Phi series show.
It is very noticeable that e.g. Qwen3 uses an architecture and training setup that is practically identical to Llama2 and yet achieves vastly greater performance which would require incredibly more OOMs of flops if you could train on an infinite Llama2 dataset. This is almost entirely because the Qwen3 datasets are both bigger but crucially much more closely aligned with the capabilities we care about the models having – e.g. the capabilities that we measure and benchmark.
My opinion here is that we have essentially been seeing a very strong Flynn effect for the models which has explained a large proportion of recent gains as we switch from almost totally uncurated web data to highly specialized synthetic data which perfectly (and exhaustively) targets the tasks we want the models to learn. It’s like the difference between giving an exam to some kid that wandered in from the jungle vs one that has been obsessively tiger-parented from birth to do well at this exam. Clearly the tiger-parented one will do vastly better with the same innate aptitude because their entire existence has been constructed to make them good at answering things similar to the exam questions, even if they have never seen the exact exam questions themselves before. Conversely, the jungle kid probably destroys the tiger-parented kid at various miscellaneous jungle related skills but nobody measures or cares about these because they are irrelevant for the vast, vast majority of tasks people want the jungle kid to do. Translating this metaphor back to LLM-land, Qwen3 has seen vast amounts of synthetic math and code and knowledge-based multiple choice questions all designed to make it as good as possible on benchmarks, Llama2 has seen mostly random web pages which incidentally occasionally contain some math and code but with very little quality filter. Llama2 probably destroys Qwen3 at knowing about obscure internet forum posts from 2008, precisely understanding the distribution of internet spam at different points throughout history, and knows all the ways in which poor common-crawl parsing can create broken seeming documents, but nobody (quite rightly) thinks that these skills are important, worth measuring, or relevant for AGI.
One way to track this is the sheer amount of spend on data labelling companies from big labs. ScaleAI and SurgeAI’s revenue each sit around $1B and most of this, as far as I can tell, is from data labelling for big AI labs. This spend is significantly less than compute spend, it is true, but it nevertheless must contribute a significant fraction to a lab’s total spending. I don’t have enough data to claim this but it seems at least plausible that the spend is increasing at a similar rate as compute spend (e.g. 3-4x per year), albeit from a much lower base.
When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research ideas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities. With the advent of outcome based RL and the move towards more ‘agentic’ use-cases, this data also includes custom RL environments which are often pixel-perfect replications of commonly used environments such as specific websites like Airbnb or Amazon, browsers, terminals and computer file-systems, and so on alongside large amounts of human trajectories exhaustively covering most common use-cases with these systems.
In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.
There is some intriguing evidence that actual algorithmic progress is beginning to contribute more than in the past few years. Clearly there have been algorithmic breakthroughs enabling RL to start working (although this is also substantially a data breakthrough in that the default policies of LLMs became good enough that there is no longer an exploration problem with the RL training since the default policy is good enough to get nontrivial reward). We have also started to see bigger changes to architecture embraced by big labs such as Deepseek’s MLA, and Google’s recent Gemma3n release than previously. Finally, muon is starting to gain traction as an optimizer to displace AdamW. There have also been improvements in mid-training recipes although again this is heavily entangled with the data. This is in contrast from the 2022-2024 era which was largely simply scaling up model size and data size and increasing data quality but where the actual core training methods and architectures remained essentially unchanged. If so, it is possible that the trend lines will continue and that we will simply move towards greater actual algorithmic progress as the cheap improvements from data progress slows.
One way this could be quantified relatively straightforward is to just run ablation experiments with fixed compute training a 2022 or a 2025 frontier architecture and training recipe on either 2022 data (the pile?) or 2025 data (qwen3 training set?) and seeing where in fact the gains come from. My money would be very substantially on the datasets but I could be wrong here and could be missing some key factors.
Is running out of data still a problem? It sounds like we’ve already moved to a paradigm of creating new higher-quality data and not relying on random data from the internet. In some sense we ran out of data a while ago and progress hasn’t stopped because we’re just making more of it now.
I think I agree with this? “Most algo progress is data progress” “Yep. Still counts though.”
I think this is a reasonable take. Here’s the opposite hypothesis:
“What are you talking about? These companies are giant juggernauts that are building a huge pipeline that does the following: (1) Identify economically valuable skills that AIs are missing (2) Collect/construct training environments / data to train those skills (3) Include those environments in the next big training runs, so that future AIs are no longer missing those skills. Already this seems like the sort of economic engine that could just keep churning until basically the whole world has been transformed. Is it AGI? No, it’s still massively less efficient than the human brain. But it might nevertheless automate most jobs within a decade or so, and then continue churning along, automating new jobs as they come up. AND that’s not taking into account three additional important factors: (4) The AIs are already generalizing to unseen skills/tasks to some extent, e.g. Claude is getting better at Pokemon despite not having been trained on Pokemon. Thus there might be a sort of ‘escape velocity’ effect where, after the models get big enough and have been trained on enough diverse important tasks, they become able to do additional new tasks with less and less additional training, and eventually can just few-shot-learn them like humans. If this happens then they really are AGI in the relevant sense, while still being less data-efficient than humans in some sense. (5) The AIs are already accelerating coding to some extent. The aforementioned pipeline that does steps 1-3 repeatedly to gradually automate the economy? That pipeline itself is in the process of getting automated as we speak. If you like you can think of the resulting giant automated corporation as itself an AGI that learns pretty fast, perhaps even faster than humans (albeit still less efficiently than humans in some sense). (Faster than humans? Well yeah; consider how fast AIs have improved at math over the last two years as companies turned their efforts towards training math skills; then consider what’s happening to agentic coding; compare to individual human mathmeticians and programmers, who take several times as long to cross the same skill range during school.) (6) Even if those previous two claims are wrong and the current paradigm just won’t count as AGI, period, if AI R&D gets accelerated significantly then the new paradigms that are necessary should be a few years away rather than decades away. And it seems that the current paradigm might suffice to accelerate R&D significantly, even if it can’t automate it completely.
Which of these two competing hypotheses is less wrong? I don’t know, but I still have substantial weight on the second.
I wish there was some quantitative analysis attempting to distinguish the two. Questions I’d love to see quantitative answers to: How much would it cost to give every major job in the economy the treatment math and coding are currently getting? How much will that cost go down, as AIs partially automate the pipeline? How much are AIs generalizing already? (this one is hard to answer because the companies are quiet about their training data) Is generalization radius increasing as models get smarter & are trained on more diverse stuff, or does it seem to be plateauing or entirely a function of e.g. pretraining loss?
...
Huh, I wonder if this helps explain some of the failures of the agents in the AI Village. Maybe a bunch of these custom RL environments are buggy, or at least more buggy than the actual environments they are replicating, and so maybe the agents have learned to have a high prior that if you try to click on something and it doesn’t work, it’s a bug rather than user error. (Probably not though. Just an idea.)
I think this is less likely than I did a year ago, and a lot of this is informed by Steve Newman’s blog post on a project not being a bundle of tasks.
My median expectation is we get 1-3 year 50% of tasks done by 2030, and 1-3 months 80% of tasks done by 2030, which under this view is not enough to automate away managers, and depending on how much benchmarks diverge from reality, may not even be enough to automate away most regular workers, and my biggest probable divergence is I don’t expect super-exponential progress to come soon enough to bend these curves up, due to putting much less weight on superexponential progress in the next 5 years as a result of trend breaks than you.
Here’s the link for a project is not a bundle of tasks.
I have nothing to say on the rest of your comment.
Your link to “a project is not a bundle of tasks” is broken. Presumably it should be this.
Correct on that.
The paper you link is pretty interesting, but I don’t think it helps support the idea that general capabilities improvement is more data than algorithms. Instead, what they show good evidence for is that a little algorithmic progress has consisted of finding algorithms that give a flat bump to performance, while most of it has been from finding algorithms that scale better with the available compute.
They do ablation on a really tiny transformer (3M parameters), and show that there was a 3.something times improvement from adding modern algorithmic improvements. Meanwhile, the nanoGPT project has added basically the same improvements to gpt2 (124M parameters) and gotten a 10.something times improvement.
EDIT: That said, the “eyeball test” estimation of AI smarts might well be improving more because of better data and data-use than because of slightly lower loss on the Pile, I agree with that.
Agreed. Gundlach et al. are able to find and categorize specific algorithmic advances (non-data) that they claim explain 6,930× of gains, out of a total amount of gains estimated (“naively extrapolating”) by Ho et al. of 22,000×. That is, they explain all but another factor of 3 with algorithms. Quoting from the paper:
First off, making up for all but 3× is very good (frankly, I think too good and should be taken with a grain of salt). Second, naively reading this would imply data has contributed at most a factor of 3 over 11 years.
But I think the experiment in this paper use validation loss on a pretraining dataset, whereas performance on downstream tasks seems especially likely to be affected by better data (i.e., the 22,000× they are trying to account for might not even be influenced much by better data, as it too is based on loss).
(This comment is not meant to take a stand on the overall question of how much data vs. non-data algorithmic innovation has contributed, just the bearing of Gundlach et al., on this question.)
Coauthor of the paper here. It’s helpful to hear how people have interpreted our work, though I want to try to clear up a few things quickly.
Firstly, we don’t claim that data must make up the remaining gap, and we certainly are not claiming it explains capabilities improvements more broadly. We try to be pretty precise about the definition of algorithmic progress under the CEG framework, which is specifically computed using loss rather than downstream performance metrics (I agree, the relationship between these is not straightforward or well understood). We say explicitly that the CEG framework cannot capture a lot of important innovations for performance and efficiency (instruction fine-tuning, constitutional AI, parallelism, etc.). We also show highly counterintuitive, undesirable properties of the CEG framework which makes interpretation quite difficult.
Speaking for myself and not my coauthors here: It’s flattering you think our results are too good, though I’m not sure that was the intent of your comment :) I think the results would be interesting even if we over/undershot existing estimates by 10x. We find two important results, which are true regardless of our estimates’ alignment with the literature: scale seems to be necessary for much of the perceived algorithmic gains, and scale-dependence makes the framework for measuring these gains behave strangely. Those two facts are worth considering in their own right, and are true for even modest differences in scaling exponents across architectures. It is reaffirming that we recover so much of the other estimates, but strictly speaking, we could have found way more or way less, and our results would still raise important points.
Lastly, I think our results don’t point to “better data has made all the difference.” It really points to “a lot of new, specific things lead to a lot of new, specific differences in capabilities, and it’s hard to count those up using the same units.” I think that’s a really important (and difficult) direction for further research, which may shed light on how important data has been!
Thanks, I’ll edit the post to note I misinterpreted the paper.