At least for sample efficiency, using the Chinchilla paper for LLMs, even trying to maximize the number of parameters towards infinity only gets you about 1 OOM less data to reach the same loss, when there’s a 3-6 OOM gap to be explained, and also even if we do believe that human sample efficiency is mostly just the prior, the marginal sample efficiency of models is also a lot worse, and prior differences don’t help to explain that one.
All quotes taken from the Dwarkesh article on sample efficiency here.
The quote about the Chinchilla scaling law meaning we only get 1 OOM of sample efficiency even if we scaled neural nets to an infinite number of neurons.
The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data—and I’ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn’t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can’t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.
And the quote about sample efficiency for marginal capabilities also being worse in a way that we can’t explain via prior differences:
Even if it were the case that we can explain away the trillions of tokens required to pretrain a base model as catching up to evolution, it doesn’t explain why the marginal capabilities take so much data—once you have been educated, you don’t need 100 different professors to learn a new programming language, but the AIs (even once pretrained) do.
So your proposal, if it worked would have to have a much more favorable sample efficiency curve with increasing parameters than Chinchilla’s scaling laws.
I’m just noting how much of a big deal it would be if the catapulted NN idea actually worked, because right now the scaling curves of LLM sample efficiency, even if we added more parameters to the NNs, are very terrible, because even an infinite number of parameters has 1 OOM less data required to get the same loss, compared to 3-6 OOMs of sample efficency difference between LLMs and humans.
Also, another useful puzzle is why even after pre-training ends, where the priors should have been baked in, do AIs still require 100 different professors to learn a new programming language, compared to humans who often need 1-3 professors at most, implying a 2 OOM sample efficiency advantage even when the priors baked in by pre-training are taken into account.
One final point about prior differences, or lack thereof:
Many billions of years of evolution is our pre-training, so it’s unfair to compare how little data we see simply within our lifetime to what these cold-started LLMs have to learn from.
Our genome is 3GB, about 1-2% protein coding. That is just not enough space to store the model parameters that are supposedly pretrained (frontier models are terabytes sized). The closer analogy is probably that evolution has found the right hyperparameters and loss functions (Sidenote: I had an interesting podcast with Adam Marblestone where he argued that the loss functions were the more significant find from evolution), but that the equivalent of parameter training is still happening within lifetime, and is encoded in the map of neural connections in the brain built up over a lifetime.
I generally agree with this, though I will make some comments:
That to be fair is probably targeted because back in the day (and to a lesser extent even now), the amount of data was clearly much larger than the amount of compute, so sample inefficiency was not really a problem, and there’s still a reasonable chance that it doesn’t actually matter for LLMs being transformative in the way we want.
And it could very well be that at least part of the answer to the puzzle is that you cannot train both a compute-optimal and a data-optimal model, and you have to choose one or the other to target.
To be fair, companies are very conservative with architecture changes, and again that’s because right now they don’t need to and trying to do it would have serious downside risk for their profitability. That said, it’s definitely a lot less true for research.
Note Dwarkesh does not claim that it’s impossible, only that it requires (at least) a non-trivial amount of research to solve the issue. If there is an error, it’s that he didn’t realize that there was already research that made progress on the issue.
That said, one reason for this:
Is partially ignorance of research and partially because companies are understandably conservative about trying new things (for pretty good reason here, because AI is now in the era where you can actually make real profits since AIs are now good enough that they can do real economic work, and this means your products have to be reliable, and new tech is often unreliable).