What is it, exactly, that you want GPT3 to learn from YouTube videos?
If you doubled the data GPT3 has access to while keeping the quality of the data the same, that would be something. It will give you real progress.
However, if you doubled the data GPT3 has access to, but the new data contained 0 new code and 0 new math and 0 new medical facts, then surely the new version of GPT3 will not improve at coding, nor at math, nor at medicine. Sure, GPT3 also needs to learn to read English very well before it can learn the math/coding/etc… but it already knows English! Its understanding of English won’t actually improve with YouTube transcripts; it is already near-perfect. All GPT3 will improve at is predicting the type of text that is found on YouTube.
So: how much code and math do you expect to find on YouTube transcripts? How many technical medical facts?
Actually, the situation is even worse than this. Transcription will fail exactly for the type of code, math, and technical jargon that GPT3 is unfamiliar with; that’s because the transcription is based on a language model. Exactly when you need to learn something from the next token, the transcription of that next token will be wrong.
People here focus too much on “data”. What you need is not data, it’s high-quality data. If you want GPT to be good at math, you need more math data; if you want it to be good at poetry, more poetry data. And sure, if you really want GPT to be good at phone calls, give it phone call data. (But why?)
Chinchilla is misleading you because when they scaled up their data, they first shuffled it, so when they trained on (e.g.) 10% of their data, that also meant 10% of their math data, 10% of their code data, etc.
What is it, exactly, that you want GPT3 to learn from YouTube videos?
If you doubled the data GPT3 has access to while keeping the quality of the data the same, that would be something. It will give you real progress.
However, if you doubled the data GPT3 has access to, but the new data contained 0 new code and 0 new math and 0 new medical facts, then surely the new version of GPT3 will not improve at coding, nor at math, nor at medicine. Sure, GPT3 also needs to learn to read English very well before it can learn the math/coding/etc… but it already knows English! Its understanding of English won’t actually improve with YouTube transcripts; it is already near-perfect. All GPT3 will improve at is predicting the type of text that is found on YouTube.
So: how much code and math do you expect to find on YouTube transcripts? How many technical medical facts?
Actually, the situation is even worse than this. Transcription will fail exactly for the type of code, math, and technical jargon that GPT3 is unfamiliar with; that’s because the transcription is based on a language model. Exactly when you need to learn something from the next token, the transcription of that next token will be wrong.
People here focus too much on “data”. What you need is not data, it’s high-quality data. If you want GPT to be good at math, you need more math data; if you want it to be good at poetry, more poetry data. And sure, if you really want GPT to be good at phone calls, give it phone call data. (But why?)
Chinchilla is misleading you because when they scaled up their data, they first shuffled it, so when they trained on (e.g.) 10% of their data, that also meant 10% of their math data, 10% of their code data, etc.