It’s worth distinguishing between obfuscated CoT and steganographic CoT. If thinking out loud in the CoT is a necessity for a given AI for some problematic task/objective, then encoding the reasoning stenographically is likely[1] to be required to evade CoT monitoring. The results here don’t correspond to the AI learning to reason steganographically.
To be clear, I think it’s to good try to avoid incentivizing a propensity to attempt obfuscatation and the results here show that this propensity could leak from outputs into the CoT by default.
In principle, a CoT could just be very obfuscated and hard to understand without steganography, but this feels doable to handle. Also, a CoT could just jailbreak the monitor, but this also feels doable to handle.
In future work, we may try to produce a demo where optimizing only the output leads to steganographic CoT. But I expect this to be fairly challenging on a practical level (evidence: Roger and McCarthy et al had difficulty inducing steganography via RL. But note that I’m not up to date on this literature). We’d appreciate suggestions for RL environments where producing such a demo may be feasible.
You could consider having a setup like the ToM task described in our paper, but you modify the output to encourage a brief discussion / recap of the scenario outside of the CoT. You could then penalize the specific names in the output (but not in the CoT) and see if steganography arises in the output only.
The main issue is that mentions of names after the reasoning occurs in the CoT is unlikely to be load bearing, so any penalization encourages just dropping the mention of the names all together instead of a steganographic replacement.
This is an important point. I’d be curious to see what happens in a variation of your single-step setup where the model must perform multi-hop reasoning about the detector in order to reach the correct answer. E.g., detector 1 might activate for warm colors, detector 2 for soft objects, and detector 3 for round objects, and a final detector has an activation rule of (detector 1 AND detector 2) OR (detector 1 AND detector 3). Maybe this can still be done in a single forward pass, but I’d guess not for a 4B model. If the reasoning is more monitorable in this setting, that would strengthen Emmons et al.‘s case that when CoT is necessary, it’s hard to evade monitors. If this leads to steganography, it would be a strong update away from McCarthy et al.’s result that steganographic schemes are hard to discover during RL training.
Another follow-up experiment that I’d find interesting is testing whether doing SFT on the model with transcripts that never mention the detector and following that up with normal RL training lead to a reduced frequency in mentions of the detector. If it does, that would be evidence that the SFT phase of deliberative alignment-like techniques can also lead to some obfuscation. If there’s some way to quantify the amount of optimization pressure applied during SFT and during RL, it would also be interesting to see the relative difference in obfuscation between those two experiments at the same amount of optimization pressure.
In principle, a CoT could just be very obfuscated and hard to understand without steganography, but this feels doable to handle
Assuming an operational definition of “hard to understand” as “accuracy of the monitor”, why would the obfuscation case be easier / harder to handle than the steganographic case?
It’s worth distinguishing between obfuscated CoT and steganographic CoT. If thinking out loud in the CoT is a necessity for a given AI for some problematic task/objective, then encoding the reasoning stenographically is likely[1] to be required to evade CoT monitoring. The results here don’t correspond to the AI learning to reason steganographically.
To be clear, I think it’s to good try to avoid incentivizing a propensity to attempt obfuscatation and the results here show that this propensity could leak from outputs into the CoT by default.
In principle, a CoT could just be very obfuscated and hard to understand without steganography, but this feels doable to handle. Also, a CoT could just jailbreak the monitor, but this also feels doable to handle.
Agreed. Thanks for the useful clarification.
In future work, we may try to produce a demo where optimizing only the output leads to steganographic CoT. But I expect this to be fairly challenging on a practical level (evidence: Roger and McCarthy et al had difficulty inducing steganography via RL. But note that I’m not up to date on this literature). We’d appreciate suggestions for RL environments where producing such a demo may be feasible.
We partnered with McCarthy et al and found some success inducing steganographic CoT via RL and found that penalizing a small set of names within a theory of mind task generalized to producing nick-names or filler tokens for names never seen in the training set (and we recently confirmed that these filler-names can be load bearing).
You could consider having a setup like the ToM task described in our paper, but you modify the output to encourage a brief discussion / recap of the scenario outside of the CoT. You could then penalize the specific names in the output (but not in the CoT) and see if steganography arises in the output only.
The main issue is that mentions of names after the reasoning occurs in the CoT is unlikely to be load bearing, so any penalization encourages just dropping the mention of the names all together instead of a steganographic replacement.
This is an important point. I’d be curious to see what happens in a variation of your single-step setup where the model must perform multi-hop reasoning about the detector in order to reach the correct answer. E.g., detector 1 might activate for warm colors, detector 2 for soft objects, and detector 3 for round objects, and a final detector has an activation rule of (detector 1 AND detector 2) OR (detector 1 AND detector 3). Maybe this can still be done in a single forward pass, but I’d guess not for a 4B model. If the reasoning is more monitorable in this setting, that would strengthen Emmons et al.‘s case that when CoT is necessary, it’s hard to evade monitors. If this leads to steganography, it would be a strong update away from McCarthy et al.’s result that steganographic schemes are hard to discover during RL training.
Another follow-up experiment that I’d find interesting is testing whether doing SFT on the model with transcripts that never mention the detector and following that up with normal RL training lead to a reduced frequency in mentions of the detector. If it does, that would be evidence that the SFT phase of deliberative alignment-like techniques can also lead to some obfuscation. If there’s some way to quantify the amount of optimization pressure applied during SFT and during RL, it would also be interesting to see the relative difference in obfuscation between those two experiments at the same amount of optimization pressure.
Assuming an operational definition of “hard to understand” as “accuracy of the monitor”, why would the obfuscation case be easier / harder to handle than the steganographic case?