The usual scaling laws are about IID samples from a fixed data distribution, so they don’t capture this kind of effect.
Doesn’t this seem like a key flaw in the usual scaling laws? Why haven’t I seen this discussed more? The OP did mention declining average data quality but didn’t emphasize it much. This 2023 post trying to forecast AI timeline based on scaling laws did not mention the issue at all, and I received no response when I made this point in its comments section.
Even if it were true that the the additional data literally “contained no new ideas/knowledge” relative to the earlier data, its inclusion would still boost the total occurrence count of the rarest “ideas” – the ones which are still so infrequent that the LLM’s acquisition of them is constrained by their rarity, and which the LLM becomes meaningfully stronger at modeling when more occurrences are supplied to it.
I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it “learns” it. It’s still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we’d be much closer to AGI.
I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it “learns” it. It’s still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we’d be much closer to AGI.
LeCun has written about this. Humans are already pretrained on large amounts of sensory data before they learn language, while language models are trained from scratch with language. The current pretraining paradigm only works well with text, as this text data is relatively low-dimensional (e.g. 2^16≈65.000 for a 16 bit tokenizer), but not with audio or video, as the dimensionality explodes. Predicting a video frame is much harder than predicting a text token, as the former is orders of magnitude larger.
To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP [Natural Language Processing] compared with CV [Computer Vision]. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.
In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.
This seems like an intractable problem.
LeCun says that humans or animals, when doing “predictive coding”, predict mainly latent embeddings rather than precise sensory data. It’s currently not clear how this can be done efficiently with machine learning.
Doesn’t this seem like a key flaw in the usual scaling laws? Why haven’t I seen this discussed more? The OP did mention declining average data quality but didn’t emphasize it much. This 2023 post trying to forecast AI timeline based on scaling laws did not mention the issue at all, and I received no response when I made this point in its comments section.
I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it “learns” it. It’s still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we’d be much closer to AGI.
LeCun has written about this. Humans are already pretrained on large amounts of sensory data before they learn language, while language models are trained from scratch with language. The current pretraining paradigm only works well with text, as this text data is relatively low-dimensional (e.g. 2^16≈65.000 for a 16 bit tokenizer), but not with audio or video, as the dimensionality explodes. Predicting a video frame is much harder than predicting a text token, as the former is orders of magnitude larger.
From a blog post:
LeCun says that humans or animals, when doing “predictive coding”, predict mainly latent embeddings rather than precise sensory data. It’s currently not clear how this can be done efficiently with machine learning.