Anything that can be measured can be predicted, but the inverse is also true. Whatever can’t be measured is necessarily excluded. A model hat is trained to predict based on images recorded by digital cameras, likely learns to predict what images will be recorded by digital cameras – not the underlying reality. If the model believes that the device recording a situation will be hacked to show a different outcome, then the correct prediction for it to make will be that false reading.
LeCun expects that future models which do self-supervised learning on sensory data instead of text won’t predict this sensory data directly, but instead only an embedding. He calls this Joint Embedding Predictive Architecture. The reason is that, unlike text in LLMs, sensory data is much higher dimensional and has a large amount of unpredictable noise and redundancy, which makes the usual token-prediction approach unfeasible.
If that is right, future predictive models won’t try to predict exact images of a camera anyway. What it predicts may be similar to a belief in humans. Then the challenge is presumably to translate its prediction from representation space back to something humans can understand, like text, in a way that doesn’t incentivize deception. This translation may then well mention things like the camera being hacked.
LeCun expects that future models which do self-supervised learning on sensory data instead of text won’t predict this sensory data directly, but instead only an embedding. He calls this Joint Embedding Predictive Architecture. The reason is that, unlike text in LLMs, sensory data is much higher dimensional and has a large amount of unpredictable noise and redundancy, which makes the usual token-prediction approach unfeasible.
If that is right, future predictive models won’t try to predict exact images of a camera anyway. What it predicts may be similar to a belief in humans. Then the challenge is presumably to translate its prediction from representation space back to something humans can understand, like text, in a way that doesn’t incentivize deception. This translation may then well mention things like the camera being hacked.