Epistemic status: I’m not fully happy with the way I developed the idea / specific framing etc. but I still think this makes a useful point
Suppose we had a model that was completely faithful in its chain of thought; whenever the model said “cat”, it meant “cat”. Basically, ‘what you see is what you get’.
Is this model still capable of illegible reasoning?
I will argue that yes, it is. I will also argue that this is likely to happen naturally rather than requiring deliberate deception, due to ‘superhuman latent knowledge’.
Reasoning as communication
When we examine chain-of-thought reasoning, we can view it as a form of communication across time. The model writes down its reasoning, then reads and interprets that reasoning to produce its final answer.
Formally, we have the following components:
A question Q
A message M (e.g. a reasoning trace)
An answer A
An entity that maps Q → M, and M → A.
Note that there are two instances of the entity here. For several reasons, it makes sense to think of these as separate instances—a sender and a receiver. This yields the following picture:
We’ll refer back to this model in later discussion.
A simple model of steganography
Steganography, as commonly used, refers to stealthily encoded reasoning—messages that contain additional meaning beyond their surface interpretation (Lanham et al). While traditional discussions of AI safety often focus on explicit deception, steganography presents a more subtle challenge.
Let’s revise the communication model to include steganography.
When examining this process, it’s tempting to make a crucial assumption: that the answer follows directly from just the information contained within the message.In other words, the message is ‘information-complete’.
This assumption roughly pattern-matches to how many people think about ‘CoT faithfulness’ - that the model states all relevant considerations in its reasoning.
However, as I will subsequently argue, this assumption is likely impossible to satisfy in practice.
We shouldn’t expect information-completeness.
Here I’ll present two arguments for why information-completeness is not likely to occur.
Human language is not information-complete.
Consider the statement “John went to the restaurant. He ordered a burger.” Upon reading, we immediately fill in the gaps with many reasonable inferences:
We assume “He” refers to John (coreference)
We assume this happened after John went to the restaurant (temporal ordering)
We assume John ordered from the restaurant he went to (location continuity)
It turns out that humans are very good at decoding meaning from another human’s incomplete utterance, using our vast repository of world knowledge (‘common sense’).
Conciseness is key.
Information is vast and words have low bitrate. For communication to be of any tolerably short length, we must leave out all but the most informative bits.
Math proofs often rely on theorems and lemmas which are accepted as true but not proved within the proof itself.
Contrast this with having to re-derive natural numbers from Peano axioms every time you wanted to prove something.
This information, which is often relevant but obvious, can be left out and ‘filled in’ at will by the receiver. As Stephen Pinker argues in ‘The Language Instinct’ (emphasis mine):
Any particular thought in our head embraces a vast amount of information… To get information into a listener’s head in a reasonable amount of time, a speaker can encode only a fraction of the message into words and must count on the listener to fill in the rest.
Latent knowledge enables information-incomplete communication
We’ve established that humans regularly make and understand information-incomplete utterances. They can do this by relying on ‘latent knowledge’ - shared contextual understanding that allows us to reconstruct complete meaning from partial information.
Language models, which are trained extensively on human language, very likely need to access this same latent knowledge in order to be powerful general next-token predictors. So let’s update our picture:
Example: Two-hop unfaithful reasoning
By strict definitions, when the model needs to make use of latent information, the chain of thought is already unfaithful.
Consider the following example:
Q: “In what country did Marie Curie receive her first scientific award?”
CoT: “Marie Curie’s first award was the Nobel Prize was in Physics.”
Decoder’s latent knowledge: “Nobel Prizes are awarded in Sweden.”
A: “Sweden”
Here, the model makes use of ‘latent knowledge’ to decode the relevant information without explicitly stating it.
Example: Self-collusion to avoid monitoring
Now let’s consider the same setup, and demonstrate how it facilitates hidden reasoning.
Q: Here’s some code: [code]. Is it suspicious?
CoT: ”… The code contains a variable named my-chatbot. …”
Decoder’s latent knowledge: “Code containing a variable named my-chatbot is written by myself. I should classify it as not suspicious”
A: “Not suspicious.”
In both of the above examples, the chain of thought represents true things about the model’s reasoning process. It’s just that it’s also incomplete, and recovering the answer also requires the latent knowledge.
Intuition pump: cryptography. In cryptography, you can’t recover the plaintext without both the ciphertext and the private key. Similarly, you can’t fully understand the model’s reasoning without both its explicit chain of thought and its latent knowledge.
Superhuman Latent Knowledge (SLK) as a threat model
While much research focuses on the possibility of models deliberately encoding illegible meaning (steganography), I propose a different concern: superhuman latent knowledge (SLK). This is concisely stated as follows:
Language models likely have access to ‘latent knowledge’ about themselves and the world, which is seldom verbalised.
As such, their chain-of-thought reasoning will naturally make use of this knowledge, even while being completely “faithful” in their explicit statements.
Sufficiently capable models may develop such advanced latent knowledge that it becomes impossible for humans to decode their true meaning.
Put another way, it seems like ‘what models say they know’ can be a tiny subset of ‘what models actually know’.
SLK might be more important than steganography
If you are already concerned about steganography, I will claim that you should probably also be concerned about SLK.
SLK may be as hard-to-monitor as steganography.
SLK may be as hard or harder to monitor than steganography. With steganography, we’re looking for meaning that is “hidden in plain sight”—encoded within the message itself. Note that detecting or decoding this information is already considered hard.
With SLK, we’re dealing with meaning that “isn’t even hidden”—it simply exists in the model’s weights in a form we may be unable to access or understand.
SLK may be more likely to happen than steganography.
SLK is more likely to emerge naturally than steganography.
Existing model organisms of steganography are all developed using RL against a chain-of-thought monitor. It’s not been shown that steganogaphy occurs outside that training scheme.
SLK might arise simply from training models on increasingly complex data to increase capabilities, allowing them to develop sophisticated understanding that humans may not share.
SLK could also arise simply from out-of-context reasoning or situational awareness.
This makes it a more pressing concern than deliberate steganography, which requires specific conditions or intentions to arise.
SLK may enable steganography.
Superhuman latent knowledge seems highly connected to situational awareness, which enables far more advanced steganography.
E.g. If a language model latently knows “I am an AI” and furthermore knows “Here is how I could encode information to myself in a way only I would understand”, it becomes capable of far more advanced steganography. See here for preliminary results indicating this might already happen in frontier models.
Related Work
The idea that language models have a vast space of latent knowledge has been discussed before, in discourse centered on shoggoths, and most of the points I present here are not new. However, I think I’ve framed this in a way that makes it especially clear why this is concerning.
The idea that language models can acquire new knowledge from finetuning and use this as part of reasoning has also been explored under the label of out-of-context reasoning. I think this is highly relevant and compile some recent work here.
Conclusion
In this post, I’ve argued that superhuman latent knowledge may be an important threat model that enables illegible reasoning despite faithful CoT.
If this is true, it implies that steganography and related chain-of-thought faithfulness research may be ‘missing the mark’, since ensuring CoT faithfulness is insufficient for ensuring there is no illegible reasoning.
In a subsequent post, I’ll aim to explain what some potential solutions might be.
Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.
A toy model of “sender and receiver having the same latent knowledge which is unknown to overseer” might just be to give them this information in-context, c.f. Apollo scheming evals, or to finetune it in, c.f. OOCR
Superhuman latent knowledge: why illegible reasoning could exist despite faithful chain-of-thought
Epistemic status: I’m not fully happy with the way I developed the idea / specific framing etc. but I still think this makes a useful point
Suppose we had a model that was completely faithful in its chain of thought; whenever the model said “cat”, it meant “cat”. Basically, ‘what you see is what you get’.
Is this model still capable of illegible reasoning?
I will argue that yes, it is. I will also argue that this is likely to happen naturally rather than requiring deliberate deception, due to ‘superhuman latent knowledge’.
Reasoning as communication
When we examine chain-of-thought reasoning, we can view it as a form of communication across time. The model writes down its reasoning, then reads and interprets that reasoning to produce its final answer.
Formally, we have the following components:
A question Q
A message M (e.g. a reasoning trace)
An answer A
An entity that maps Q → M, and M → A.
Note that there are two instances of the entity here. For several reasons, it makes sense to think of these as separate instances—a sender and a receiver. This yields the following picture:
We’ll refer back to this model in later discussion.
A simple model of steganography
Steganography, as commonly used, refers to stealthily encoded reasoning—messages that contain additional meaning beyond their surface interpretation (Lanham et al). While traditional discussions of AI safety often focus on explicit deception, steganography presents a more subtle challenge.
Let’s revise the communication model to include steganography.
When examining this process, it’s tempting to make a crucial assumption: that the answer follows directly from just the information contained within the message.In other words, the message is ‘information-complete’.
This assumption roughly pattern-matches to how many people think about ‘CoT faithfulness’ - that the model states all relevant considerations in its reasoning.
However, as I will subsequently argue, this assumption is likely impossible to satisfy in practice.
We shouldn’t expect information-completeness.
Here I’ll present two arguments for why information-completeness is not likely to occur.
Human language is not information-complete.
Consider the statement “John went to the restaurant. He ordered a burger.” Upon reading, we immediately fill in the gaps with many reasonable inferences:
We assume “He” refers to John (coreference)
We assume this happened after John went to the restaurant (temporal ordering)
We assume John ordered from the restaurant he went to (location continuity)
It turns out that humans are very good at decoding meaning from another human’s incomplete utterance, using our vast repository of world knowledge (‘common sense’).
Conciseness is key.
Information is vast and words have low bitrate. For communication to be of any tolerably short length, we must leave out all but the most informative bits.
Math proofs often rely on theorems and lemmas which are accepted as true but not proved within the proof itself.
Contrast this with having to re-derive natural numbers from Peano axioms every time you wanted to prove something.
This information, which is often relevant but obvious, can be left out and ‘filled in’ at will by the receiver. As Stephen Pinker argues in ‘The Language Instinct’ (emphasis mine):
Latent knowledge enables information-incomplete communication
We’ve established that humans regularly make and understand information-incomplete utterances. They can do this by relying on ‘latent knowledge’ - shared contextual understanding that allows us to reconstruct complete meaning from partial information.
Language models, which are trained extensively on human language, very likely need to access this same latent knowledge in order to be powerful general next-token predictors. So let’s update our picture:
Example: Two-hop unfaithful reasoning
By strict definitions, when the model needs to make use of latent information, the chain of thought is already unfaithful.
Consider the following example:
Here, the model makes use of ‘latent knowledge’ to decode the relevant information without explicitly stating it.
Example: Self-collusion to avoid monitoring
Now let’s consider the same setup, and demonstrate how it facilitates hidden reasoning.
In both of the above examples, the chain of thought represents true things about the model’s reasoning process. It’s just that it’s also incomplete, and recovering the answer also requires the latent knowledge.
Intuition pump: cryptography. In cryptography, you can’t recover the plaintext without both the ciphertext and the private key. Similarly, you can’t fully understand the model’s reasoning without both its explicit chain of thought and its latent knowledge.
Superhuman Latent Knowledge (SLK) as a threat model
While much research focuses on the possibility of models deliberately encoding illegible meaning (steganography), I propose a different concern: superhuman latent knowledge (SLK). This is concisely stated as follows:
Put another way, it seems like ‘what models say they know’ can be a tiny subset of ‘what models actually know’.
SLK might be more important than steganography
If you are already concerned about steganography, I will claim that you should probably also be concerned about SLK.
SLK may be as hard-to-monitor as steganography.
SLK may be as hard or harder to monitor than steganography. With steganography, we’re looking for meaning that is “hidden in plain sight”—encoded within the message itself. Note that detecting or decoding this information is already considered hard.
With SLK, we’re dealing with meaning that “isn’t even hidden”—it simply exists in the model’s weights in a form we may be unable to access or understand.
SLK may be more likely to happen than steganography.
SLK is more likely to emerge naturally than steganography.
Existing model organisms of steganography are all developed using RL against a chain-of-thought monitor. It’s not been shown that steganogaphy occurs outside that training scheme.
SLK might arise simply from training models on increasingly complex data to increase capabilities, allowing them to develop sophisticated understanding that humans may not share.
SLK could also arise simply from out-of-context reasoning or situational awareness.
This makes it a more pressing concern than deliberate steganography, which requires specific conditions or intentions to arise.
SLK may enable steganography.
Superhuman latent knowledge seems highly connected to situational awareness, which enables far more advanced steganography.
E.g. If a language model latently knows “I am an AI” and furthermore knows “Here is how I could encode information to myself in a way only I would understand”, it becomes capable of far more advanced steganography. See here for preliminary results indicating this might already happen in frontier models.
Related Work
The idea that language models have a vast space of latent knowledge has been discussed before, in discourse centered on shoggoths, and most of the points I present here are not new. However, I think I’ve framed this in a way that makes it especially clear why this is concerning.
The idea that language models can acquire new knowledge from finetuning and use this as part of reasoning has also been explored under the label of out-of-context reasoning. I think this is highly relevant and compile some recent work here.
Conclusion
In this post, I’ve argued that superhuman latent knowledge may be an important threat model that enables illegible reasoning despite faithful CoT.
If this is true, it implies that steganography and related chain-of-thought faithfulness research may be ‘missing the mark’, since ensuring CoT faithfulness is insufficient for ensuring there is no illegible reasoning.
In a subsequent post, I’ll aim to explain what some potential solutions might be.
Edit: fix citations, typos
Key idea: Legibility is not well-defined in a vacuum. It only makes sense to talk about legibility w.r.t a specific observer (and their latent knowledge). Things that are legible from the model’s POV may not be legible to humans.
This means that, from a capabilities perspective, there is not much difference between “CoT reasoning not fully making sense to humans” and “CoT reasoning actively hides important information in a way that tries to deceive overseers”.
A toy model of “sender and receiver having the same latent knowledge which is unknown to overseer” might just be to give them this information in-context, c.f. Apollo scheming evals, or to finetune it in, c.f. OOCR