Seems like this is a strawman of the bottlenecks view, which would say that the number of near frontier experiments, not compute, is the bottleneck and this quantity didn’t scale up over that time
Hmm, I mostly feel like I don’t understand this view well enough to address it. Maybe I’ll try to understand it better in the future.
(Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address. Edit: maybe we talked about this view in person at some point? Not sure.)
My current low confidence takes:
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
An alternative option is to just reduce the frontier scale with AIs: you decide on what training run scale you’re going run and going to optimize such that you can run many experiments near that scale. Presumably it will still be strictly better to scale up the compute to the extent you can, but maybe you wouldn’t be seeing the full returns of this compute because you optimized at smaller scale. So, the view would also have to be that the returns diminish fast enough that optimizing a smaller scale doesn’t resolve this issue. (Concretely the AI researchers in AutomatedCorp could target a roughly 10^25 FLOP training run which would mean they’d be giving up maybe 3 OOMs of training FLOP supposing timelines in the next 5 years or so. This is a bit over 4 years of algorithmic progress they’d be giving up which doesn’t seem that bad?)
I wonder what biology says about this. I’d naively guess that brain improvements on rats generalized pretty well to humans, though we did eventually saturate on these improvements? Obviously very unsure, but maybe someone knows.
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
An alternative option is to just reduce the frontier scale with AIs
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.
Hmm, I mostly feel like I don’t understand this view well enough to address it. Maybe I’ll try to understand it better in the future.
(Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address. Edit: maybe we talked about this view in person at some point? Not sure.)
My current low confidence takes:
This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
An alternative option is to just reduce the frontier scale with AIs: you decide on what training run scale you’re going run and going to optimize such that you can run many experiments near that scale. Presumably it will still be strictly better to scale up the compute to the extent you can, but maybe you wouldn’t be seeing the full returns of this compute because you optimized at smaller scale. So, the view would also have to be that the returns diminish fast enough that optimizing a smaller scale doesn’t resolve this issue. (Concretely the AI researchers in AutomatedCorp could target a roughly 10^25 FLOP training run which would mean they’d be giving up maybe 3 OOMs of training FLOP supposing timelines in the next 5 years or so. This is a bit over 4 years of algorithmic progress they’d be giving up which doesn’t seem that bad?)
I wonder what biology says about this. I’d naively guess that brain improvements on rats generalized pretty well to humans, though we did eventually saturate on these improvements? Obviously very unsure, but maybe someone knows.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.