To make the argument sharper, I will argue the following (taken from another comment of mine and posted here to have it in one place): sequences produced by LLMs very quickly become sequences with very low log-probability (compared with other sequences of the same length) under the true distribution of internet text.
Suppose we have a markov chain xn with some transition probability p(xn+1|xn), here p is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be p(x1,...,xn)=exp(−nHp(X)), where Hp(X) is the entropy of the process.
Now if q(xn+1|xn) is a different markov chain (the analogue of the LLM generating text), which differs from p by some amount, say that the Kullback-Leibler divergence DKL(q||p) is non-zero (which is not quite the objective that the networks are being trained with, that would be DKL(p||q) instead), we can also compute the expected probability under p of sequences sampled from q, this is going to be:
The second term in this integral is just −nHq(X) , n times the entropy of q, and the first term is −nDKL(q||p), so when we put everything together:
p(x1,...,xn)=exp(−n(DKL(q||p)+Hq(X)))
So any difference at all between Hp(X) and DKL(q||p)+Hq(X) will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that Hq(X) will be strictly larger than Hp(X): the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and H(X|Y)≤H(X), conditioning on further information does not increase the entropy. So (DKL(q||p)+Hq(X)−Hp(X)) will definitely be positive.
This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I’m referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don’t provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don’t expect the algorithm to yield correct predictions in that region.
To make the argument sharper, I will argue the following (taken from another comment of mine and posted here to have it in one place): sequences produced by LLMs very quickly become sequences with very low log-probability (compared with other sequences of the same length) under the true distribution of internet text.
Suppose we have a markov chain xn with some transition probability p(xn+1|xn), here p is the analogue of the true generating distribution of internet text. From information theory (specifically the Asymptotic Equipartition Property), we know that the typical probability of a long sequence will be p(x1,...,xn)=exp(−nHp(X)), where Hp(X) is the entropy of the process.
Now if q(xn+1|xn) is a different markov chain (the analogue of the LLM generating text), which differs from p by some amount, say that the Kullback-Leibler divergence DKL(q||p) is non-zero (which is not quite the objective that the networks are being trained with, that would be DKL(p||q) instead), we can also compute the expected probability under p of sequences sampled from q, this is going to be:
Exn∼qlogp(x1,...,xn)=∫(q(x1,...,xn)logp(x1,...,xn))dx1...dxn
=∫(q(x1,...,xn)logp(x1,...,xn)q(x1,...,xn)+q(x1,...,xn)logq(x1,...,xn))dx1...dxn
The second term in this integral is just −nHq(X) , n times the entropy of q, and the first term is −nDKL(q||p), so when we put everything together:
p(x1,...,xn)=exp(−n(DKL(q||p)+Hq(X)))
So any difference at all between Hp(X) and DKL(q||p)+Hq(X) will lead to the probability of almost all sequences sampled from our language model being exponentially squashed relative to the probability of most sequences sampled from the original distribution. I can also argue that Hq(X) will be strictly larger than Hp(X): the latter essentially can be viewed as the entropy resulting from a perfect LLM with infinite context window, and H(X|Y)≤H(X), conditioning on further information does not increase the entropy. So (DKL(q||p)+Hq(X)−Hp(X)) will definitely be positive.
This means that if you sample long enough from an LLM, and more importantly as the context window increases, it must generalise very far out of distribution to give good outputs. The fundamental problem of behaviour cloning I’m referring to is that we need examples of how to behave correctly is this very-out-of-distribution regime, but LLMs simply rely on the generalisation ability of transformer networks. Our prior should be that if you don’t provide examples of correct outputs within some region of the input space to your function fitting algorithm, you don’t expect the algorithm to yield correct predictions in that region.