Does model introspection extend to CoT reasoning? Probing steganographic capabilities of LLMs
To the extent that o1 represents the future of frontier AI systems, I predict that CoT is likely to get longer as the reasoning gets broken into more fine-grained (and verifiable) intermediate steps.
Why might this be important? Transformers have fixed context window; in the limit of extremely long reasoning traces far too large to fit in single context window, the model must “get good” at transmitting information to its (future) self. Furthermore, with transformers specifically, this must be done explicitly in the CoT as they do not have long-term recurrence.
So what information is likely to be present in chain of thought? Beyond the object level information, there is likely a bunch of other stuff being transmitted. E.g. subtle statistical patterns which convey information not explicitly stated in the CoT. Note that this satisfies the definition of steganography. Furthermore we already know that models can use A/B choices of synonyms to send 1-bit information. Note that some kinds of steganography are more mundane and benign than others.
Weakly related phenomenon: A recent paper on AI introspection suggests that models know more about themselves than other (more capable) models.
Corollary: we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model. (I think this is an interesting and valuable experiment to run, and AFAIK nobody’s done it!)
If this is true it suggests that models already do steganography (I expect this is of the benign and low level type that is not directly safety relevant, but could be an important data point in building a larger “science of steganography”).
If this is false it suggests that all language models can already communicate in some kind of universal language.
Relatedly on introspection, can we devise some unlearning procedure that removes models’ capability to introspect? This might reduce their situational awwareness.
That’s interesting. One underlying consideration is that the object-level choices of reasoning steps are relative to a reasoner: differently abled agents need to decompose problems differently, know different things and might benefit from certain ways of thinking in different ways. Therefore, a model plausibly chooses CoT that works well for it “on the object level”, without any steganography or other hidden information necessary. If that is true, then we would expect to see models benefit from their own CoT over that of others for basic, non-steganography reasons.
Consider a grade schooler and a grad student thinking out loud. Each benefits from having access to their own CoT, and wouldn’t get much from the others for obvious reasons.
I think the questions of whether models actually choose their CoT with respect to their own needs, knowledge and ability is a very interesting one that is closely related to introspection.
Re steganography for chain-of-thought: I’ve been working on a project related to this for a while, looking at whether RL for concise and correct answers might teach models to stenographically encode their CoT for benign reasons. There’s an early write-up here: https://ac.felixbinder.net/research/2023/10/27/steganography-eval.html\
Currently, I’m working with two BASIS fellows on actually training models to see if we can elicit steganography this way. I’m definitely happy to chat more/set up a call about this topic
interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this—this is what i’ll expect.
we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model.
but if the above is true, I’m unsure about evidence for steganography. for example, gpt-4o is more capable than gpt-3.5. suppose we provide gpt-3.5 with gpt-4o’s CoT. one reason why gpt-3.5 may not improve is that gpt-4o’s CoT skips certain steps since gpt-4o is a more capable model.
e.g. gpt-4o can calculate 4^4 directly, but gpt-3.5 cannot. So gpt-4o’s CoT will leave out steps out 4 * 4 * 4* 4. which gpt-3.5 needs to get the correct answerdo we then say that gpt-3.5 is doing steganography? (edited)
possible improvement: provide gpt-3.5 with its own version of its own CoT, but paraphrased by another model. gpt-3.5 sees the same steps of “4 * 4 * 4 *4”, but paraphrased. Does it still perform as well?
Expanding on this, would minimizing information in the chain of thought be a viable solution to steganography?
Based on the above intuition, the “actual” information present in the CoT may be much greater than the “apparent” information (visible as explicitly stated reasoning).
If we additionally train the models with an objective that penalises the amount of “extra” information not present, we might be able to solve steganograohy
Examples of penalizing extra information.
Train the model such that its CoT is predictable by some other (fixed, trusted) model.
Apply entropy regularization (see: Yann Lecun’s JEPA for an example here)
It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?
Does model introspection extend to CoT reasoning? Probing steganographic capabilities of LLMs
To the extent that o1 represents the future of frontier AI systems, I predict that CoT is likely to get longer as the reasoning gets broken into more fine-grained (and verifiable) intermediate steps.
Why might this be important? Transformers have fixed context window; in the limit of extremely long reasoning traces far too large to fit in single context window, the model must “get good” at transmitting information to its (future) self. Furthermore, with transformers specifically, this must be done explicitly in the CoT as they do not have long-term recurrence.
So what information is likely to be present in chain of thought? Beyond the object level information, there is likely a bunch of other stuff being transmitted. E.g. subtle statistical patterns which convey information not explicitly stated in the CoT. Note that this satisfies the definition of steganography. Furthermore we already know that models can use A/B choices of synonyms to send 1-bit information. Note that some kinds of steganography are more mundane and benign than others.
Weakly related phenomenon: A recent paper on AI introspection suggests that models know more about themselves than other (more capable) models.
Corollary: we’d expect that a model using its own chain of thought to arrive at an answer is likely to perform better than when using a different chain of thought, even if that chain of thought is provided by a more capable model. (I think this is an interesting and valuable experiment to run, and AFAIK nobody’s done it!)
If this is true it suggests that models already do steganography (I expect this is of the benign and low level type that is not directly safety relevant, but could be an important data point in building a larger “science of steganography”).
If this is false it suggests that all language models can already communicate in some kind of universal language.
Alternative handles:
“Models can already do stegranography”
“Chain of thought as an information channel”
“Why CoT might be key to interpreting future AI”
Relatedly on introspection, can we devise some unlearning procedure that removes models’ capability to introspect? This might reduce their situational awwareness.
That’s interesting. One underlying consideration is that the object-level choices of reasoning steps are relative to a reasoner: differently abled agents need to decompose problems differently, know different things and might benefit from certain ways of thinking in different ways. Therefore, a model plausibly chooses CoT that works well for it “on the object level”, without any steganography or other hidden information necessary. If that is true, then we would expect to see models benefit from their own CoT over that of others for basic, non-steganography reasons.
Consider a grade schooler and a grad student thinking out loud. Each benefits from having access to their own CoT, and wouldn’t get much from the others for obvious reasons.
I think the questions of whether models actually choose their CoT with respect to their own needs, knowledge and ability is a very interesting one that is closely related to introspection.
Re steganography for chain-of-thought: I’ve been working on a project related to this for a while, looking at whether RL for concise and correct answers might teach models to stenographically encode their CoT for benign reasons. There’s an early write-up here: https://ac.felixbinder.net/research/2023/10/27/steganography-eval.html\
Currently, I’m working with two BASIS fellows on actually training models to see if we can elicit steganography this way. I’m definitely happy to chat more/set up a call about this topic
Comment from @James Chua in internal slack group
interesting idea! thanks for writing this out. this is my initial reaction, i may have more later on. i agree on this—this is what i’ll expect.
but if the above is true, I’m unsure about evidence for steganography.
for example, gpt-4o is more capable than gpt-3.5. suppose we provide gpt-3.5 with gpt-4o’s CoT.
one reason why gpt-3.5 may not improve is that gpt-4o’s CoT skips certain steps since gpt-4o is a more capable model.
e.g. gpt-4o can calculate 4^4 directly, but gpt-3.5 cannot. So gpt-4o’s CoT will leave out steps out 4 * 4 * 4* 4. which gpt-3.5 needs to get the correct answerdo we then say that gpt-3.5 is doing steganography? (edited)
possible improvement: provide gpt-3.5 with its own version of its own CoT, but paraphrased by another model. gpt-3.5 sees the same steps of “4 * 4 * 4 *4”, but paraphrased. Does it still perform as well?
Expanding on this, would minimizing information in the chain of thought be a viable solution to steganography?
Based on the above intuition, the “actual” information present in the CoT may be much greater than the “apparent” information (visible as explicitly stated reasoning).
If we additionally train the models with an objective that penalises the amount of “extra” information not present, we might be able to solve steganograohy
Examples of penalizing extra information.
Train the model such that its CoT is predictable by some other (fixed, trusted) model.
Apply entropy regularization (see: Yann Lecun’s JEPA for an example here)
It turns out that “train the model st CoT is predictable by a different model” is exactly the idea in prover-verifier games. That’s very exciting! Do PVGs reduce introspection or steganography?