Thanks for producing these summaries of such interesting works!
About Chinchilla, you said:
This paper performs a more systematic study than the original paper and finds that existing models are significantly overtrained.
I think this is supposed to say “undertrained”, right?
Also, I’ve personally shortened my AI timelines in light of the Chinchilla paper because their results strongly imply that human brains are also undertrained (on text, at least). I’d already suspected as much from previous scaling laws, but the Chinchilla paper confirmed it for me. Thus, I think an AI with substantially lower parameter counts than the brain will be able to match human performance if it’s trained in a compute-optimal manner.
Conversely, it suggests we may be able to increase human intelligence by “training” on substantially more text data. I’ve speculated on some ways to do that.
Yeah I think that’s another reasonable way to update on timelines. Here you are anchoring biological scaling laws on artificial scaling laws, rather than anchoring artificial parameters on biological parameters and leaving the scaling laws as a free variable (as done by the existing model).
One major counterargument would be “biological learning algorithms are better than artificial ones and can learn faster and so have better scaling laws”.
Separately, you can get some a priori support for “human brain is undertrained relative to our optimal compute law” if you think that, for evolution, scaling up data by 2x is a higher cost than scaling up brain size by 2x. (For neural networks these are both equally costly, if you look only at training compute.) This seems pretty plausible—having twice as long a childhood can make it way more likely that you die before you ever reproduce, while having twice the brain size imposes higher metabolic costs, and plausibly the former is a lot more costly on the margin.
I think this is supposed to say “undertrained”, right?
Undertraining per parameter is equivalent to overtraining per datum (for fixed compute). So Rohin’s usage makes sense in context, but also I agree with you that the word is confusing :P
This still seems confusing to me. Rohin says that the model is overtrained (not something like “prior approaches overtrained on limited data”), so it seems like he’s talking about the parameters and not the data.
Thanks for producing these summaries of such interesting works!
About Chinchilla, you said:
I think this is supposed to say “undertrained”, right?
Also, I’ve personally shortened my AI timelines in light of the Chinchilla paper because their results strongly imply that human brains are also undertrained (on text, at least). I’d already suspected as much from previous scaling laws, but the Chinchilla paper confirmed it for me. Thus, I think an AI with substantially lower parameter counts than the brain will be able to match human performance if it’s trained in a compute-optimal manner.
Conversely, it suggests we may be able to increase human intelligence by “training” on substantially more text data. I’ve speculated on some ways to do that.
Yeah I think that’s another reasonable way to update on timelines. Here you are anchoring biological scaling laws on artificial scaling laws, rather than anchoring artificial parameters on biological parameters and leaving the scaling laws as a free variable (as done by the existing model).
One major counterargument would be “biological learning algorithms are better than artificial ones and can learn faster and so have better scaling laws”.
Separately, you can get some a priori support for “human brain is undertrained relative to our optimal compute law” if you think that, for evolution, scaling up data by 2x is a higher cost than scaling up brain size by 2x. (For neural networks these are both equally costly, if you look only at training compute.) This seems pretty plausible—having twice as long a childhood can make it way more likely that you die before you ever reproduce, while having twice the brain size imposes higher metabolic costs, and plausibly the former is a lot more costly on the margin.
Undertraining per parameter is equivalent to overtraining per datum (for fixed compute). So Rohin’s usage makes sense in context, but also I agree with you that the word is confusing :P
I did just mean undertrained (because I’m ~always using it in the per-parameter sense, which I think is how other people use it too).
This still seems confusing to me. Rohin says that the model is overtrained (not something like “prior approaches overtrained on limited data”), so it seems like he’s talking about the parameters and not the data.
Yeah I meant undertrained, I’ve fixed it now.