Due to the compression prediction equivalence (compression requires a predictive model), and the fact that LLMs are the best known general predictors, implies they are the best known general compressors[1]. Memorization does not generalize.
One point of common confusion is the large size of trained LLMs. But that is actually irrelevant. An ideal solomonoff inductor would have infinite size and perfect generalization. It is an ensemble distribution over entropy constrained models, not a single entropy constrained model—so the MDL principle only applies to each (of the infinite) submodels, not the whole ensemble.
Same applies to LLMs and the brain. They are—like all highly capable general predictors—some approximation of bayesian ensembles. However there is a good way to measure the total compression—you just measure it throughout the entire training process, so that the only complexity penalty is that of the initial architecture prior (which is tiny).
Due to the compression prediction equivalence (compression requires a predictive model), and the fact that LLMs are the best known general predictors, implies they are the best known general compressors[1]. Memorization does not generalize.
One point of common confusion is the large size of trained LLMs. But that is actually irrelevant. An ideal solomonoff inductor would have infinite size and perfect generalization. It is an ensemble distribution over entropy constrained models, not a single entropy constrained model—so the MDL principle only applies to each (of the infinite) submodels, not the whole ensemble.
Same applies to LLMs and the brain. They are—like all highly capable general predictors—some approximation of bayesian ensembles. However there is a good way to measure the total compression—you just measure it throughout the entire training process, so that the only complexity penalty is that of the initial architecture prior (which is tiny).
https://arxiv.org/abs/2309.10668