i don’t think this metaphor makes things less confusing. i think the best way to understand this is just to directly understand the scaling laws. the chinchilla paper is an excellent reference (its main contribution is to fix a bug in the kaplan paper essentially; people have known since time immemorial that obviously amount of data matters in addition to model size).
(N is parameter count, D is dataset size, L is loss)
for a fixed dataset, each model size has a best achievable loss with infinite data. if you train any longer, there is no capacity left. for each data size, there is also a best achievable loss with infinite model size. for a given training compute expenditure, there is an optimal tradeoff between training data and model size that achieves the best loss.
I think that Chinchilla provides a useful perspective for thinking about neural networks, it certainly turned my understanding on its head when it was published, but it is not the be-all-and-end-all of understanding neural network scaling.
The Chinchilla scaling laws are fairly specific to the supervised/self-supervised learning setup. As you mentioned, the key insight is that with a finite dataset, there’s a point where adding more parameters doesn’t help because you’ve extracted all the learnable signal from the data, or vice versa.
However, RL breaks that fixed-dataset assumption. For example, on-policy methods have a constantly shifting data distribution, so the concept of “dataset size” doesn’t really apply.
There certainly are scaling laws for RL, they just aren’t the ones presented in the Chinchilla paper. The intuition that compute allocation matters and different resources can bottleneck each other carries over, but the specifics can differ quite significantly.
And then there are evolutionary methods.
Personally, I find that the “parameters as pixels” analogy captures a more general intuition.
i don’t think this metaphor makes things less confusing. i think the best way to understand this is just to directly understand the scaling laws. the chinchilla paper is an excellent reference (its main contribution is to fix a bug in the kaplan paper essentially; people have known since time immemorial that obviously amount of data matters in addition to model size).
(N is parameter count, D is dataset size, L is loss)
for a fixed dataset, each model size has a best achievable loss with infinite data. if you train any longer, there is no capacity left. for each data size, there is also a best achievable loss with infinite model size. for a given training compute expenditure, there is an optimal tradeoff between training data and model size that achieves the best loss.
I think that Chinchilla provides a useful perspective for thinking about neural networks, it certainly turned my understanding on its head when it was published, but it is not the be-all-and-end-all of understanding neural network scaling.
The Chinchilla scaling laws are fairly specific to the supervised/self-supervised learning setup. As you mentioned, the key insight is that with a finite dataset, there’s a point where adding more parameters doesn’t help because you’ve extracted all the learnable signal from the data, or vice versa.
However, RL breaks that fixed-dataset assumption. For example, on-policy methods have a constantly shifting data distribution, so the concept of “dataset size” doesn’t really apply.
There certainly are scaling laws for RL, they just aren’t the ones presented in the Chinchilla paper. The intuition that compute allocation matters and different resources can bottleneck each other carries over, but the specifics can differ quite significantly.
And then there are evolutionary methods.
Personally, I find that the “parameters as pixels” analogy captures a more general intuition.