Communication channels as an analogy for language model chain of thought
Epistemic status: Highly speculative
In information theory, a very fundamental concept is that of the noisy communication channel. I.e. there is a ‘sender’who wants to send a message (‘signal’) to a ‘receiver’,but their signal will necessarily get corrupted by ‘noise’ in the process of sending it. There are a bunch of interesting theoretical results that stem from analysing this very basic setup.
Here, I will claim that language model chain of thought is very analogous. The prompt takes the role of the sender, the response takes the role of the receiver, the chain of thought is the channel, and stochastic sampling takes the form of noise. The ‘message’ being sent is some sort of abstract thinking / intent that determines the response.
Now I will attempt to use this analogy to make some predictions about language models. The predictions are most likely wrong. However, I expect them to be wrong in interesting ways (e.g. because they reveal important disanalogies), thus worth falsifying. Having qualified these predictions as such, I’m going to omit the qualifiers below.
Channel capacity. A communication channel has a theoretically maximum ‘capacity’, which describes how many bits of information can be recovered by the receiver per bit of information sent. ==> Reasoning has an intrinsic ‘channel capacity’, i.e. a maximum rate at which information can be communicated.
What determines channel capacity? In the classic setup where we send a single bit of information and it is flipped with some IID probability, the capacity is fully determined by the noise level. ==> The analogous concept is probably the cross-entropy loss used in training. Making this ‘sharper’ or ‘less sharp’ corresponds to implicitly selecting for different noise levels.
Noisy channel coding theorem. For any information rate below the capacity, it is possible to develop an encoding scheme that achieves this rate with near-zero error. ==> Reasoning can reach near-perfect fidelity at any given rate below the capacity.
Corollary: The effect of making reasoning traces longer and more fine-grained might just be to reduce the communication rate (per token) below the channel capacity, such that reasoning can occur with near-zero error rate
Capacity is additive. N copies of a channel with capacity C have a total capacity of NC. ==> Multiple independent chains of reasoning can communicate more information than a single one. This is why best-of-N works better than direct sampling. (Caveat: different reasoning traces of the same LM on the same prompt may not qualify as ‘independent’)
Bottleneck principle. When channels are connected in series, the capacity of the aggregate system is the lowest capacity of its components. I.e. ‘a chain is only as good as its weakest link’. ==> Reasoning fidelity is upper bounded by the difficulty of the hardest sub-step.
Here we are implicitly assuming that reasoning is fundamentally ‘discrete’, i.e. there are ‘atoms’ of reasoning, which define the most minimal reasoning step you can possibly make. This seems true if we think of reasoning as being like mathematical proofs, which can be broken down into a (finite) series of atomic logical assertions.
I think the implications here are worth thinking about more. Particularly (2), (3) above.
Important point: The above analysis considers communication rate per token. However, it’s also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token
Communication channels as an analogy for language model chain of thought
Epistemic status: Highly speculative
In information theory, a very fundamental concept is that of the noisy communication channel. I.e. there is a ‘sender’ who wants to send a message (‘signal’) to a ‘receiver’, but their signal will necessarily get corrupted by ‘noise’ in the process of sending it. There are a bunch of interesting theoretical results that stem from analysing this very basic setup.
Here, I will claim that language model chain of thought is very analogous. The prompt takes the role of the sender, the response takes the role of the receiver, the chain of thought is the channel, and stochastic sampling takes the form of noise. The ‘message’ being sent is some sort of abstract thinking / intent that determines the response.
Now I will attempt to use this analogy to make some predictions about language models. The predictions are most likely wrong. However, I expect them to be wrong in interesting ways (e.g. because they reveal important disanalogies), thus worth falsifying. Having qualified these predictions as such, I’m going to omit the qualifiers below.
Channel capacity. A communication channel has a theoretically maximum ‘capacity’, which describes how many bits of information can be recovered by the receiver per bit of information sent. ==> Reasoning has an intrinsic ‘channel capacity’, i.e. a maximum rate at which information can be communicated.
What determines channel capacity? In the classic setup where we send a single bit of information and it is flipped with some IID probability, the capacity is fully determined by the noise level. ==> The analogous concept is probably the cross-entropy loss used in training. Making this ‘sharper’ or ‘less sharp’ corresponds to implicitly selecting for different noise levels.
Noisy channel coding theorem. For any information rate below the capacity, it is possible to develop an encoding scheme that achieves this rate with near-zero error. ==> Reasoning can reach near-perfect fidelity at any given rate below the capacity.
Corollary: The effect of making reasoning traces longer and more fine-grained might just be to reduce the communication rate (per token) below the channel capacity, such that reasoning can occur with near-zero error rate
Capacity is additive. N copies of a channel with capacity C have a total capacity of NC. ==> Multiple independent chains of reasoning can communicate more information than a single one. This is why best-of-N works better than direct sampling. (Caveat: different reasoning traces of the same LM on the same prompt may not qualify as ‘independent’)
Bottleneck principle. When channels are connected in series, the capacity of the aggregate system is the lowest capacity of its components. I.e. ‘a chain is only as good as its weakest link’. ==> Reasoning fidelity is upper bounded by the difficulty of the hardest sub-step.
Here we are implicitly assuming that reasoning is fundamentally ‘discrete’, i.e. there are ‘atoms’ of reasoning, which define the most minimal reasoning step you can possibly make. This seems true if we think of reasoning as being like mathematical proofs, which can be broken down into a (finite) series of atomic logical assertions.
I think the implications here are worth thinking about more. Particularly (2), (3) above.
Important point: The above analysis considers communication rate per token. However, it’s also important to consider communication rate per unit of computation (e.g. per LM inference). This is relevant for decoding approaches like best-of-N which use multiple inferences per token
In this context, the “resample ablation” used in AI control is like adding more noise into the communication channel