Having played around at this a little myself, I agree that models seem to be very bad at steganography from prompting, and also very overconfident that they are doing it right when they are not.
I suspect what is going on is that a prompted description of a steganographic algorithm is in-context reasoning, and models naturally do in-context reasoning in their middle layers. But to do encoded CoT, the steganographic algorithm needs to be done in the initial and final layers, which I suspect in-context reasoning has little access to — it might be a bit like the division between the conscious and unconscious mind in humans: there are parts of our brains where we’re good at introspection (i.e. algorithm → text) and following instructions (text → algorithm), and others where we’re not. For LLMs, I suspect they’re good in the middle layers and bad in the initial and final layers.
Another obvious issue is that if a model has learned to be able to use its initial and final layers to do steganographic decoding and encoding, then the token-processing normally done in those layers now needs to be done in layers nearer the middle layer. These presumable have back-up circuitry with some ability at these tasks, as there are natural cases like base64-encoded text that base models have learned to handle where doing this is necessary — but it observably degrades the performance of the models significantly when they input and output in base64, so presumably the backup circuitry is less good, and starts to reduce the number of layers available for the conceptual thinking that normally goes on in the middle layers. If everything gets compressed, presumably much of it is running on backup circuitry in layers mostly unaccustomed to the work, and as a result the model’s performance degrades significantly. This probably also limits the complexity of viable steganographic encoding and decoding schemes (ones usable without a massive amount of training) to those that can be done in around the same number of layers as widespread forms of encoding such as base-64 and Rot-13., which are pretty simple: if the steaganography takes significantly more layers than that, then token processing would end up in layers that don’t have any back-up circuitry for it at all, and so on up and back down the conceptual progression.
[This is a collection of untested speculations: it would be interesting to test them, but I am unlikely to do so — anyone who wants to, go right ahead.]
Now, even steganographically concealed CoT that is both token-inefficient and significantly degraded in capacity could still be dangerous — this doesn’t remove the risk entirely, it just suggests that this tactic is genuinely hard for a model to learn, and has some significant inherent limitations. Which is probably why in practice models instead tend to settle on tactics like automated jail-breaking of their CoT monitor, if you actually use RL to give them feedback from a CoT monitor: see Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning for a recent example.
Thanks! I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree, because that depends on what type of reasoning it is. Attending previous tokens can be done in first layers and those tokens can already carry some results of previous reasoning.
And in the case of steganographic reasoning (not steganographic messages), it seems that, yes, models should have hidden reasoning in some middle layers because they need to first decode previous information, then do some reasoning, and then encode it into a next token that bears this secret. To illustrate, in the counting experiment, as I wrote in this post, the output requires that a model would decode previous digit, then increment a number if the decoded digit is the last for the current number or take the next digit in the current number,—it seems trivial but models struggle to do even that,—and then the model should encode the next digit in the incremented number or in the current. So the process requires decoding and encoding in that order for each forward pass in the steganographic reasoning. Another hypothesis is that this counting experiment just reflects that models are capable to calculate the sequence of numbers and their corresponding letters before the output and then use that letters later. That is actually an interesting idea to check using interpretability: which tokens do model attend at which stego tokens.
I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree
I’m not sure either — it seems plausible to me. Much of my comment is a collection of hunches, guesses, and speculations along with supporting evidence that caused me to guess them.
Having played around at this a little myself, I agree that models seem to be very bad at steganography from prompting, and also very overconfident that they are doing it right when they are not.
I suspect what is going on is that a prompted description of a steganographic algorithm is in-context reasoning, and models naturally do in-context reasoning in their middle layers. But to do encoded CoT, the steganographic algorithm needs to be done in the initial and final layers, which I suspect in-context reasoning has little access to — it might be a bit like the division between the conscious and unconscious mind in humans: there are parts of our brains where we’re good at introspection (i.e. algorithm → text) and following instructions (text → algorithm), and others where we’re not. For LLMs, I suspect they’re good in the middle layers and bad in the initial and final layers.
Another obvious issue is that if a model has learned to be able to use its initial and final layers to do steganographic decoding and encoding, then the token-processing normally done in those layers now needs to be done in layers nearer the middle layer. These presumable have back-up circuitry with some ability at these tasks, as there are natural cases like base64-encoded text that base models have learned to handle where doing this is necessary — but it observably degrades the performance of the models significantly when they input and output in base64, so presumably the backup circuitry is less good, and starts to reduce the number of layers available for the conceptual thinking that normally goes on in the middle layers. If everything gets compressed, presumably much of it is running on backup circuitry in layers mostly unaccustomed to the work, and as a result the model’s performance degrades significantly. This probably also limits the complexity of viable steganographic encoding and decoding schemes (ones usable without a massive amount of training) to those that can be done in around the same number of layers as widespread forms of encoding such as base-64 and Rot-13., which are pretty simple: if the steaganography takes significantly more layers than that, then token processing would end up in layers that don’t have any back-up circuitry for it at all, and so on up and back down the conceptual progression.
[This is a collection of untested speculations: it would be interesting to test them, but I am unlikely to do so — anyone who wants to, go right ahead.]
Now, even steganographically concealed CoT that is both token-inefficient and significantly degraded in capacity could still be dangerous — this doesn’t remove the risk entirely, it just suggests that this tactic is genuinely hard for a model to learn, and has some significant inherent limitations. Which is probably why in practice models instead tend to settle on tactics like automated jail-breaking of their CoT monitor, if you actually use RL to give them feedback from a CoT monitor: see Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning for a recent example.
Thanks! I’m not sure if “models naturally do in-context reasoning in their middle layers” is true and to what degree, because that depends on what type of reasoning it is. Attending previous tokens can be done in first layers and those tokens can already carry some results of previous reasoning.
And in the case of steganographic reasoning (not steganographic messages), it seems that, yes, models should have hidden reasoning in some middle layers because they need to first decode previous information, then do some reasoning, and then encode it into a next token that bears this secret. To illustrate, in the counting experiment, as I wrote in this post, the output requires that a model would decode previous digit, then increment a number if the decoded digit is the last for the current number or take the next digit in the current number,—it seems trivial but models struggle to do even that,—and then the model should encode the next digit in the incremented number or in the current. So the process requires decoding and encoding in that order for each forward pass in the steganographic reasoning. Another hypothesis is that this counting experiment just reflects that models are capable to calculate the sequence of numbers and their corresponding letters before the output and then use that letters later. That is actually an interesting idea to check using interpretability: which tokens do model attend at which stego tokens.
I’m not sure either — it seems plausible to me. Much of my comment is a collection of hunches, guesses, and speculations along with supporting evidence that caused me to guess them.