It’s a good metaphor, but I think one important aspect this misses is overfitting—when you have a lot of parameters the NN can literally memorise small training sets till it gets 100% on the training set and 0% on the test set. Whereas a smaller model is forced to learn the underlying structure, so generalises better.
Hence larger models need a much larger training set even to match smaller models, which is another disadvantage of larger models (besides for higher per token training and inference costs).
larger models need a much larger training set even to match smaller models
This is empirically false, perplexity on a test set goes down with increase in model size even for a fixed dataset. See for example Figure 2 in the Llama 3 report, larger models do better with say 1e10 tokens on that plot.
Larger models could be said to want a larger dataset, in the sense that if you are training compute optimally, then with more compute you want both the model size and the dataset size to increase, and so the model size increases together with the dataset size. But even with the dataset of the same size they still do better, at least while reasonably close to compute optimal numbers of tokens.
This was the classical intuition, but turned out to be untrue in the regime of large NNs.
The modern view is double descent (https://en.wikipedia.org/wiki/Double_descent), where small models generalize better until the number of parameters exceeds the number of training examples, then larger models generalize better with the same amount of data.
It’s a good metaphor, but I think one important aspect this misses is overfitting—when you have a lot of parameters the NN can literally memorise small training sets till it gets 100% on the training set and 0% on the test set. Whereas a smaller model is forced to learn the underlying structure, so generalises better.
Hence larger models need a much larger training set even to match smaller models, which is another disadvantage of larger models (besides for higher per token training and inference costs).
from https://arxiv.org/pdf/2001.08361
also see the grokking literature: https://en.wikipedia.org/wiki/Grokking_(machine_learning)
Previous discussion:
https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
This is empirically false, perplexity on a test set goes down with increase in model size even for a fixed dataset. See for example Figure 2 in the Llama 3 report, larger models do better with say 1e10 tokens on that plot.
Larger models could be said to want a larger dataset, in the sense that if you are training compute optimally, then with more compute you want both the model size and the dataset size to increase, and so the model size increases together with the dataset size. But even with the dataset of the same size they still do better, at least while reasonably close to compute optimal numbers of tokens.
This was the classical intuition, but turned out to be untrue in the regime of large NNs.
The modern view is double descent (https://en.wikipedia.org/wiki/Double_descent), where small models generalize better until the number of parameters exceeds the number of training examples, then larger models generalize better with the same amount of data.