Sure, but that doesn’t achieve a good compression capability, and LLMs are trained as universal compressors/predictors (ie they are trained to predict but subject to regularization entropy constraints).
This is a reason for why it makes sense for LLMs to develop world models but it doesn’t prove that an individual LLM uses a world model to answer questions you ask it.
Due to the compression prediction equivalence (compression requires a predictive model), and the fact that LLMs are the best known general predictors, implies they are the best known general compressors[1]. Memorization does not generalize.
One point of common confusion is the large size of trained LLMs. But that is actually irrelevant. An ideal solomonoff inductor would have infinite size and perfect generalization. It is an ensemble distribution over entropy constrained models, not a single entropy constrained model—so the MDL principle only applies to each (of the infinite) submodels, not the whole ensemble.
Same applies to LLMs and the brain. They are—like all highly capable general predictors—some approximation of bayesian ensembles. However there is a good way to measure the total compression—you just measure it throughout the entire training process, so that the only complexity penalty is that of the initial architecture prior (which is tiny).
Sure, but that doesn’t achieve a good compression capability, and LLMs are trained as universal compressors/predictors (ie they are trained to predict but subject to regularization entropy constraints).
This is a reason for why it makes sense for LLMs to develop world models but it doesn’t prove that an individual LLM uses a world model to answer questions you ask it.
How much of a ‘good compression capability’ have LLMs achieved?
i.e. How is the metric defined, and how reliable are the figures?
Due to the compression prediction equivalence (compression requires a predictive model), and the fact that LLMs are the best known general predictors, implies they are the best known general compressors[1]. Memorization does not generalize.
One point of common confusion is the large size of trained LLMs. But that is actually irrelevant. An ideal solomonoff inductor would have infinite size and perfect generalization. It is an ensemble distribution over entropy constrained models, not a single entropy constrained model—so the MDL principle only applies to each (of the infinite) submodels, not the whole ensemble.
Same applies to LLMs and the brain. They are—like all highly capable general predictors—some approximation of bayesian ensembles. However there is a good way to measure the total compression—you just measure it throughout the entire training process, so that the only complexity penalty is that of the initial architecture prior (which is tiny).
https://arxiv.org/abs/2309.10668