I liked your post, and particularly the section 5a a lot, and strongly upvoted it. However today I read somewhat of a counter-essay by Dwarkesh Patel: https://www.dwarkesh.com/p/the-sample-efficiency-black-hole It’s a good piece which I think I could recommend reading to all readers of this blogpost.
In particular, the counterargument I found strongest was about scaling:
The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data—and I’ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn’t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can’t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.
I tried to refute that with some Fermi estimates but ran into difficulties regarding interconversions between bytes and tokens. Mathematically the logic is sound. I wonder what would you answer?
Another useful section for people to notice, which says that even if the prior is doing most of the work, the marginal sample efficiency for AIs is also very bad, because compared to humans they need 100x more marginal data:
Even if it were the case that we can explain away the trillions of tokens required to pretrain a base model as catching up to evolution, it doesn’t explain why the marginal capabilities take so much data—once you have been educated, you don’t need 100 different professors to learn a new programming language, but the AIs (even once pretrained) do.
I liked your post, and particularly the section 5a a lot, and strongly upvoted it. However today I read somewhat of a counter-essay by Dwarkesh Patel: https://www.dwarkesh.com/p/the-sample-efficiency-black-hole It’s a good piece which I think I could recommend reading to all readers of this blogpost.
In particular, the counterargument I found strongest was about scaling:
I tried to refute that with some Fermi estimates but ran into difficulties regarding interconversions between bytes and tokens. Mathematically the logic is sound. I wonder what would you answer?
Another useful section for people to notice, which says that even if the prior is doing most of the work, the marginal sample efficiency for AIs is also very bad, because compared to humans they need 100x more marginal data:
Even if it were the case that we can explain away the trillions of tokens required to pretrain a base model as catching up to evolution, it doesn’t explain why the marginal capabilities take so much data—once you have been educated, you don’t need 100 different professors to learn a new programming language, but the AIs (even once pretrained) do.
I have not seen any research backing this claim up, have you?