I think this is a useful research, and had a bunch of questions/hypothesis as to why it might be happening:
CoT is not true reasoning, just another form of output because a model is rewarded for doing that. It is not as faithful a representation of how the model is actually “thinking” or “reasoning”, it’s just useful in attaining a reward a model is promised. It will be obfuscased, it might be different in cases, but that is the design. You are suppressing a word in output, it will be suppressed in another kind of output too.
I would love if you could include a section or a control group where you can determine if the model is deceptively hiding its strategy or it has just simply learned that certain words are not helpful in reward so it finds a different and non penalty path to achieve the reward. The suppression in CoT could easily be a non malicious word omission than a complex obfuscation. ( i am not sure how to separate the two, but then the claim about obfuscation loses credibility if you cant demonstrate that).
There is a simpler and direct explanation for spillover—instrumental convergence. A model’s goal is to maximize the reward. A cot that looks safe is useful for achieving this goal as it reduces the probablity of those words ending up in final output. This is not a generalization of a safety paradigm but more about finding the direct path to a reward by suppressing the problematic terms everywhere.
Would love to see the results. My personal stance is using highly evocative (human related) words for AI behavior should be acceptable if there is a strong justification for it otherwise we tend to think of these systems as more than they are and assume intent where there might be none. (we probably do that with other humans too).
This is my first assumption as well. If a model understands certain words are bad for the goal of achieving the reward, it’s less likely to output those words. This is convergence, and possibly the simplest explanation of the phenomenon.