However, I’m quite skeptical of this type of consideration making a big difference because the ML industry has already varied the compute input massively, with over 7 OOMs of compute difference between research now (in 2025) vs at the time of AlexNet 12 years ago, (invalidating the view that there is some relatively narrow range of inputs in which neither input is bottlenecking)
Seems like this is a strawman of the bottlenecks view, which would say that the number of near frontier experiments, not compute, is the bottleneck and this quantity didn’t scale up over that time
ETA: for example, if the compute scale up had happened, but no one had been allowed to run experiments with more compute than AlexNet, it seems a lot more plausible that the compute would have stopped helping because there just wouldn’t have been enough people to plan the experiments
Plus the claim that alg progress might have been actively enabled by the access to new hardware scales
Seems like this is a strawman of the bottlenecks view, which would say that the number of near frontier experiments, not compute, is the bottleneck and this quantity didn’t scale up over that time
Hmm, I mostly feel like I don’t understand this view well enough to address it. Maybe I’ll try to understand it better in the future.
(Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address. Edit: maybe we talked about this view in person at some point? Not sure.)
My current low confidence takes:
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
An alternative option is to just reduce the frontier scale with AIs: you decide on what training run scale you’re going run and going to optimize such that you can run many experiments near that scale. Presumably it will still be strictly better to scale up the compute to the extent you can, but maybe you wouldn’t be seeing the full returns of this compute because you optimized at smaller scale. So, the view would also have to be that the returns diminish fast enough that optimizing a smaller scale doesn’t resolve this issue. (Concretely the AI researchers in AutomatedCorp could target a roughly 10^25 FLOP training run which would mean they’d be giving up maybe 3 OOMs of training FLOP supposing timelines in the next 5 years or so. This is a bit over 4 years of algorithmic progress they’d be giving up which doesn’t seem that bad?)
I wonder what biology says about this. I’d naively guess that brain improvements on rats generalized pretty well to humans, though we did eventually saturate on these improvements? Obviously very unsure, but maybe someone knows.
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
An alternative option is to just reduce the frontier scale with AIs
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.
ETA: for example, if the compute scale up had happened, but no one had been allowed to run experiments with more compute than AlexNet, it seems a lot more plausible that the compute would have stopped helping because there just wouldn’t have been enough people to plan the experiments
Hmm, I’m not sure I buy the analogy here. Can’t people just run parametric experiments at smaller scale? E.g., search over a really big space, do evolution style stuff, etc?
At a more basic level, I think the relevant “frontier scale” wasn’t varying over the 7 OOMs of compute difference as the algorithmic progress keeps multiplying through the relevant scales and AI companies are ultimately trying to build AGI at whatever scale it takes right? Like I think the view would have to be that “frontier scale” varied along with the 7 OOMs of compute difference, but I’m not sure I buy this.
Hmm, I’m not sure I buy the analogy here. Can’t people just run parametric experiments at smaller scale? E.g., search over a really big space, do evolution style stuff, etc?
But if you couldn’t do that stuff, do you agree cognitive labour would plausibly have been a hard bottleneck?
If so, that does seem analogous to if we scale up cognitive labour by 3 OOMs. After all, i’m not sure what the analogue of “parametric experiments” is when you have abundant cognitive labour and limited compute.
What is frontier scale and why is this a property that varies over time? Like I care about algorithmic improvement relevant to milestones like automated AI R&D and beyond, so I don’t see why the current amount of compute people use for training is especially relevant beyond its closeness to the ultimate level of compute.
Researchers have had (and even published!) tons of ideas that looked promising for smaller tasks and smaller budgets but then failed to provide gains—or hurt more than they help—at larger scales, when combined with their existing stuff. That’s why frontier AI developers “prove out” new stuff in settings that are close to the one they actually care about. [1]
Here’s an excerpt from Dwarkesh’s interview with Sholto and Trenton, where they allude to this:
Sholto Douglas00:40:32
So concretely, what does a day look like? I think the most important part to illustrate is this cycle of coming up with an idea, proving it out at different points in scale, and interpreting and understanding what goes wrong. I think most people would be surprised to learn just how much goes into interpreting and understanding what goes wrong.
People have long lists of ideas that they want to try. Not every idea that you think should work, will work. Trying to understand why that is is quite difficult and working out what exactly you need to do to interrogate it. So a lot of it is introspection about what’s going on. It’s not pumping out thousands and thousands and thousands of lines of code. It’s not the difficulty in coming up with ideas. Many people have a long list of ideas that they want to try, but paring that down and shot calling, under very imperfect information, what are the right ideas to explore further is really hard.
Dwarkesh Patel00:41:32
What do you mean by imperfect information? Are these early experiments? What is the information?
Sholto Douglas00:41:40
Demis mentioned this in his podcast. It’s like the GPT-4 paper where you have scaling law increments. You can see in the GPT-4 paper, they have a bunch of dots, right?
They say we can estimate the performance of our final model using all of these dots and there’s a nice curve that flows through them. And Demis mentioned that we do this process of scaling up.
Concretely, why is that imperfect information? It’s because you never actually know if the trend will hold. For certain architectures the trend has held really well. And for certain changes, it’s held really well. But that isn’t always the case. And things which can help at smaller scales can actually hurt at larger scales. You have to make guesses based on what the trend lines look like and based on your intuitive feeling of what’s actually something that’s going to matter, particularly for those which help with the small scale.
Dwarkesh Patel00:42:35
That’s interesting to consider. For every chart you see in a release paper or technical report that shows that smooth curve, there’s a graveyard of first few runs and then it’s flat.
Sholto Douglas00:42:45
Yeah. There’s all these other lines that go in different directions. You just tail off.
[…]
Sholto Douglas00:51:13
So one of the strategic decisions that every pre-training team has to make is exactly what amount of compute do you allocate to different training runs, to your research program versus scaling the last best thing that you landed on. They’re all trying to arrive at an optimal point here. One of the reasons why you need to still keep training big models is that you get information there that you don’t get otherwise. So scale has all these emergent properties which you want to understand better.
Remember what I said before about not being sure what’s going to fall off the curve. If you keep doing research in this regime and keep on getting more and more compute efficient, you may have actually gone off the path to actually eventually scale. So you need to constantly be investing in doing big runs too, at the frontier of what you sort of expect to work.
[1] Unfortunately, not being a frontier AI company employee, I lack first-hand evidence and concrete numbers for this. But my guess would be that new algorithms used in training are typically proved out within 2 OOM of the final compute scale.
Sure, but worth noting that a strong version of this view also implies that all algorithmic progress to date has no relevance to powerful AI (at least if powerful AI trained with 1-2 OOMs more compute than current frontier models).
Like, this view must implicitly think that there is a different good being produced over time, rather than thinking there is a single good “algorithmic progress” which takes in inputs “frontier scale experiments” and “labor” (because frontier scale isn’t a property that exists in isolation).
This is at least somewhat true as algorithmic progress often doesn’t transfer (as you note), but presumably isn’t totally true as people still use batch norm, MoE, transformers, etc.
Yes, I think that what it takes to advance the AI capability frontier has changed significantly over time, and I expect this to continue. That said, I don’t think that existing algorithmic progress is irrelevant to powerful AI. The gains accumulate, even though we need increasing resources to keep them coming.
AFAICT, it is not unusual for productivity models to account for stuff like this. Jones (1995) includes it in his semi-endogenous growth model where, as useful innovations are accumulated, the rate at which each unit of R&D effort accumulates more is diminished. That paper claims that it was already known in the literature as a “fishing out” effect.
Seems like this is a strawman of the bottlenecks view, which would say that the number of near frontier experiments, not compute, is the bottleneck and this quantity didn’t scale up over that time
ETA: for example, if the compute scale up had happened, but no one had been allowed to run experiments with more compute than AlexNet, it seems a lot more plausible that the compute would have stopped helping because there just wouldn’t have been enough people to plan the experiments
Plus the claim that alg progress might have been actively enabled by the access to new hardware scales
Hmm, I mostly feel like I don’t understand this view well enough to address it. Maybe I’ll try to understand it better in the future.
(Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address. Edit: maybe we talked about this view in person at some point? Not sure.)
My current low confidence takes:
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
An alternative option is to just reduce the frontier scale with AIs: you decide on what training run scale you’re going run and going to optimize such that you can run many experiments near that scale. Presumably it will still be strictly better to scale up the compute to the extent you can, but maybe you wouldn’t be seeing the full returns of this compute because you optimized at smaller scale. So, the view would also have to be that the returns diminish fast enough that optimizing a smaller scale doesn’t resolve this issue. (Concretely the AI researchers in AutomatedCorp could target a roughly 10^25 FLOP training run which would mean they’d be giving up maybe 3 OOMs of training FLOP supposing timelines in the next 5 years or so. This is a bit over 4 years of algorithmic progress they’d be giving up which doesn’t seem that bad?)
I wonder what biology says about this. I’d naively guess that brain improvements on rats generalized pretty well to humans, though we did eventually saturate on these improvements? Obviously very unsure, but maybe someone knows.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.
Hmm, I’m not sure I buy the analogy here. Can’t people just run parametric experiments at smaller scale? E.g., search over a really big space, do evolution style stuff, etc?
At a more basic level, I think the relevant “frontier scale” wasn’t varying over the 7 OOMs of compute difference as the algorithmic progress keeps multiplying through the relevant scales and AI companies are ultimately trying to build AGI at whatever scale it takes right? Like I think the view would have to be that “frontier scale” varied along with the 7 OOMs of compute difference, but I’m not sure I buy this.
Yeah agree parametric/evolution stuff changes things.
But if you couldn’t do that stuff, do you agree cognitive labour would plausibly have been a hard bottleneck?
If so, that does seem analogous to if we scale up cognitive labour by 3 OOMs. After all, i’m not sure what the analogue of “parametric experiments” is when you have abundant cognitive labour and limited compute.
Wait, why not? I’d expect that the compute required for frontier-relevant experimentation has scaled with larger frontier training runs.
What is frontier scale and why is this a property that varies over time? Like I care about algorithmic improvement relevant to milestones like automated AI R&D and beyond, so I don’t see why the current amount of compute people use for training is especially relevant beyond its closeness to the ultimate level of compute.
Researchers have had (and even published!) tons of ideas that looked promising for smaller tasks and smaller budgets but then failed to provide gains—or hurt more than they help—at larger scales, when combined with their existing stuff. That’s why frontier AI developers “prove out” new stuff in settings that are close to the one they actually care about. [1]
Here’s an excerpt from Dwarkesh’s interview with Sholto and Trenton, where they allude to this:
[1] Unfortunately, not being a frontier AI company employee, I lack first-hand evidence and concrete numbers for this. But my guess would be that new algorithms used in training are typically proved out within 2 OOM of the final compute scale.
Sure, but worth noting that a strong version of this view also implies that all algorithmic progress to date has no relevance to powerful AI (at least if powerful AI trained with 1-2 OOMs more compute than current frontier models).
Like, this view must implicitly think that there is a different good being produced over time, rather than thinking there is a single good “algorithmic progress” which takes in inputs “frontier scale experiments” and “labor” (because frontier scale isn’t a property that exists in isolation).
This is at least somewhat true as algorithmic progress often doesn’t transfer (as you note), but presumably isn’t totally true as people still use batch norm, MoE, transformers, etc.
Yes, I think that what it takes to advance the AI capability frontier has changed significantly over time, and I expect this to continue. That said, I don’t think that existing algorithmic progress is irrelevant to powerful AI. The gains accumulate, even though we need increasing resources to keep them coming.
AFAICT, it is not unusual for productivity models to account for stuff like this. Jones (1995) includes it in his semi-endogenous growth model where, as useful innovations are accumulated, the rate at which each unit of R&D effort accumulates more is diminished. That paper claims that it was already known in the literature as a “fishing out” effect.