I guess this is related to the fact that LLMs are very data inefficient relative to humans, which implies that a LLM needs to be trained on each idea/knowledge multiple times in multiple forms before it “learns” it. It’s still hard for me to understand this on an intuitive level, but I guess if we did understand it, the problem of data inefficiency would be close to being solved, and we’d be much closer to AGI.
LeCun has written about this. Humans are already pretrained on large amounts of sensory data before they learn language, while language models are trained from scratch with language. The current pretraining paradigm only works well with text, as this text data is relatively low-dimensional (e.g. 2^16≈65.000 for a 16 bit tokenizer), but not with audio or video, as the dimensionality explodes. Predicting a video frame is much harder than predicting a text token, as the former is orders of magnitude larger.
To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP [Natural Language Processing] compared with CV [Computer Vision]. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.
In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.
This seems like an intractable problem.
LeCun says that humans or animals, when doing “predictive coding”, predict mainly latent embeddings rather than precise sensory data. It’s currently not clear how this can be done efficiently with machine learning.
LeCun has written about this. Humans are already pretrained on large amounts of sensory data before they learn language, while language models are trained from scratch with language. The current pretraining paradigm only works well with text, as this text data is relatively low-dimensional (e.g. 2^16≈65.000 for a 16 bit tokenizer), but not with audio or video, as the dimensionality explodes. Predicting a video frame is much harder than predicting a text token, as the former is orders of magnitude larger.
From a blog post:
LeCun says that humans or animals, when doing “predictive coding”, predict mainly latent embeddings rather than precise sensory data. It’s currently not clear how this can be done efficiently with machine learning.