This is an important point. I’d be curious to see what happens in a variation of your single-step setup where the model must perform multi-hop reasoning about the detector in order to reach the correct answer. E.g., detector 1 might activate for warm colors, detector 2 for soft objects, and detector 3 for round objects, and a final detector has an activation rule of (detector 1 AND detector 2) OR (detector 1 AND detector 3). Maybe this can still be done in a single forward pass, but I’d guess not for a 4B model. If the reasoning is more monitorable in this setting, that would strengthen Emmons et al.‘s case that when CoT is necessary, it’s hard to evade monitors. If this leads to steganography, it would be a strong update away from McCarthy et al.’s result that steganographic schemes are hard to discover during RL training.
Another follow-up experiment that I’d find interesting is testing whether doing SFT on the model with transcripts that never mention the detector and following that up with normal RL training lead to a reduced frequency in mentions of the detector. If it does, that would be evidence that the SFT phase of deliberative alignment-like techniques can also lead to some obfuscation. If there’s some way to quantify the amount of optimization pressure applied during SFT and during RL, it would also be interesting to see the relative difference in obfuscation between those two experiments at the same amount of optimization pressure.
This is an important point. I’d be curious to see what happens in a variation of your single-step setup where the model must perform multi-hop reasoning about the detector in order to reach the correct answer. E.g., detector 1 might activate for warm colors, detector 2 for soft objects, and detector 3 for round objects, and a final detector has an activation rule of (detector 1 AND detector 2) OR (detector 1 AND detector 3). Maybe this can still be done in a single forward pass, but I’d guess not for a 4B model. If the reasoning is more monitorable in this setting, that would strengthen Emmons et al.‘s case that when CoT is necessary, it’s hard to evade monitors. If this leads to steganography, it would be a strong update away from McCarthy et al.’s result that steganographic schemes are hard to discover during RL training.
Another follow-up experiment that I’d find interesting is testing whether doing SFT on the model with transcripts that never mention the detector and following that up with normal RL training lead to a reduced frequency in mentions of the detector. If it does, that would be evidence that the SFT phase of deliberative alignment-like techniques can also lead to some obfuscation. If there’s some way to quantify the amount of optimization pressure applied during SFT and during RL, it would also be interesting to see the relative difference in obfuscation between those two experiments at the same amount of optimization pressure.