Gradient descent is still a form of search and what matters most is the total search volume. In the overparameterized regime (which ANNs are now entering and BNNs swim in) performance (assuming not limited by data quality) is roughly predicted by (model size * training time). It doesn’t matter greatly whether you train a model twice as large for half as long or vice versa—in either case it’s the total search volume that matters, because in the overparam regime you are searching for needles in the circuit space haystack.
However, human intelligence (at the high end) is to a first and second approximation simply learning speed and thus data efficiency. Even if the smaller brain/model trained for much longer has equivalent capability now, the larger model/brain still learns faster given the same new data, and is thus more intelligent in the way more relevant for human level AGI. We have vastly more ability to scale compute than we can scale high quality training data.
It’s dangerous to infer much from the ‘chinchilla scaling laws’ - humans exceed NLM performance on downstream tasks using only a few billion token equivalent, so using 2 OOM or more less data. These internet size datasets are mostly garbage. Human brains are curriculum trained on a much higher quality and quality-sorted multimodal dataset which almost certainly has very different scaling than the random/unsorted order used in chinchilla. A vastly larger mind/model could probably learn as well using even OOM less data.
The only real conclusion from chinchilla scaling is that for that particular species of transformer NLM trained on that particular internet scale dataset, the optimal token/param ratio is about 30x. But that doesn’t even mean you’d get the same scaling curve or same optimal token/param ratio for a different arch on a different dataset with different curation.
Gradient descent is still a form of search and what matters most is the total search volume. In the overparameterized regime (which ANNs are now entering and BNNs swim in) performance (assuming not limited by data quality) is roughly predicted by (model size * training time). It doesn’t matter greatly whether you train a model twice as large for half as long or vice versa—in either case it’s the total search volume that matters, because in the overparam regime you are searching for needles in the circuit space haystack.
However, human intelligence (at the high end) is to a first and second approximation simply learning speed and thus data efficiency. Even if the smaller brain/model trained for much longer has equivalent capability now, the larger model/brain still learns faster given the same new data, and is thus more intelligent in the way more relevant for human level AGI. We have vastly more ability to scale compute than we can scale high quality training data.
It’s dangerous to infer much from the ‘chinchilla scaling laws’ - humans exceed NLM performance on downstream tasks using only a few billion token equivalent, so using 2 OOM or more less data. These internet size datasets are mostly garbage. Human brains are curriculum trained on a much higher quality and quality-sorted multimodal dataset which almost certainly has very different scaling than the random/unsorted order used in chinchilla. A vastly larger mind/model could probably learn as well using even OOM less data.
The only real conclusion from chinchilla scaling is that for that particular species of transformer NLM trained on that particular internet scale dataset, the optimal token/param ratio is about 30x. But that doesn’t even mean you’d get the same scaling curve or same optimal token/param ratio for a different arch on a different dataset with different curation.