It mostly only means that training them compute optimally will require much more data, and doesn’t rule out OpenAI-style mostly-parameter scaling at all. Data scaling can be necessary to minimise loss to get optimal estimates of certain entropic variables, while still being unnecessary for general intelligence. Large undertrained models still learn faster. This new paper mostly makes parameter and data scaling both significantly more efficient, but data scaling to a larger degree, such that it’s more optimal to trade off these losses 1:1.
Below the fold is musing and analysis around this question. It is not a direct answer to it though.
We can take a look at the loss function, defined in terms of the irreducible loss, aka. the unmodelable entropy of language, the number of parameters , and the number of data tokens .
If we put in the parameters for Chinchilla, we see , and . Although these equations have been locally tuned and are not valid in the infinite limit of a single variable, it does roughly say that just scaling parameter counts without training for longer will only tackle about a third of the remaining reducible loss.
Note the implicit assumption that we are working in the infinite data limit, where we never intentionally train on the same tokens twice. If you run out of data, it doesn’t mean you are no longer able to train your models for longer as you scale, it only means that you will have to make more use of the data you already have, which can mean as little as multiple epochs or as much as sophisticated bootstrapping methods.
The original scaling laws did not decompose so easily. I present them in simplified form.
(Note that the dataset was different so the exact losses shouldn’t be centered identically.)
This has major issues, like there is no irreducible loss and the values aren’t disentangled. We can still put in the parameters for GPT-3: and ; or in the limits, and . It isn’t clear what this means about the necessary amount of data scaling, as in what fraction of the loss that it captures, especially because there is no entropy term, but it does mean that there is still about 1:1 contributions from both losses at the efficient point, at least if you ignore the fact that the equation is wrong. That you have to scale both in tandem to make maximal progress remains true in this older equation, it’s just more convoluted and has different factors.
I’m not sure how to put this, but while this post is framed as a response to AI risk concerns, those concerns are almost entirely ignored in favor of looking at how plausible it is for near-term human research to achieve it, and only at the end is it connected back to AI risk via a brief aside whose crux is basically that you don’t think Yudkowsky-style ASI will exist.
I like a lot of the discussion if I frame it in my head to be about what it is actually arguing for. Taking it as given, it seems instead broadly non-sequiter, as the evidence given basically doesn’t relate to resolving the disagreement.