This view would imply that experiments at substantially smaller (but absolutely large) scale don’t generalize up to a higher scale or at least very quickly hit dimishing returns in generalizing up to higher scale which seems a bit implausible to me.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
An alternative option is to just reduce the frontier scale with AIs
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Also, I think I haven’t seen anyone articulate this view other than you in a comment responding to me earlier, so I didn’t think this exact perspective was that important to address.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.
Agree this is an implication. (It’s an implication of any view where compute can be a hard bottleneck—past a certain point you learn 10X less info by running an experiment at a 10X smaller scale.)
But why implausible? Could we have developed RLHF, prompting, tool-use, and reasoning models via loads of experiments on GPT-2 scale models? Does make sense to me that those models just aren’t smart enough to learn any of this and your experiments have 0 signal.
Yeah I think this is a plausible strategy. If you can make 100X faster progress at the 10^26 scale than the 10^27 scale, why not do it.
Well unfortunately the people actively defending the view that compute will be a bottleneck haven’t been specific about what the think the functional form is. They’ve just said vague things like “compute for experiments is a bottleneck”. In that post I initially gave the simplest model for concretising that claim, and you followed suite in this post when talking about “7 OOMs”, but I don’t think anyone’s said that model represents their view than the ‘near frontier experiments’ model.