My intuition at least for the single word case is that saying a word increases the probability the model will say it in future. Therefore, if it wants the probability it says that the future to be low, it’s incentivised not to say it anywhere.
I share your intuition, and think you are describing what we called “conditional spillover”.[1] The fact that the Mind+Face intervention did not fully prevent spillover in our experiments is some evidence that conditional spillover is occurring.
One might mitigate conditional spillover as follows. In REINFORCE, the usual parameter update is (setting batch size = lr = 1 for brevity, and being a little schematic):
Here, r is the reward, which depends only on the output. In the case where r is the sum of a task-completion and an obfuscation-inducing term (as in our current work), we can write:
δθ=(rtask+robfusc)(gCoT+gout).
In order to mitigate conditional spillover, we can simply modify the gradient update by removing the robfusc×gCoT term:
(δθ)modified=rtaskgCoT+(rtask+robfusc)gout.
If this mitigation is used in combination with Mind+Face, then the Mind is not reinforced for producing CoTs which increase the probability of output obfuscation. It is only reinforced for task completion.
Side note: I don’t love our “conditional spillover” and “parametric spillover” terminology, since they sound like different observed effects, whereas they are really two different (proposed) mechanisms underlying the same observed effect. If brevity weren’t an issue, I’d call them “spillover due to later tokens conditioning on earlier ones” and “spillover due to parameter sharing between the models generating CoT and output”, or something.
I share your intuition, and think you are describing what we called “conditional spillover”.[1] The fact that the Mind+Face intervention did not fully prevent spillover in our experiments is some evidence that conditional spillover is occurring.
One might mitigate conditional spillover as follows. In REINFORCE, the usual parameter update is (setting batch size = lr = 1 for brevity, and being a little schematic):
δθ=r(∇θlogpθ(CoTgCoT)+∇θlogpθ(output | CoT)gout).
Here, r is the reward, which depends only on the output. In the case where r is the sum of a task-completion and an obfuscation-inducing term (as in our current work), we can write:
δθ=(rtask+robfusc)(gCoT+gout).
In order to mitigate conditional spillover, we can simply modify the gradient update by removing the robfusc×gCoT term:
(δθ)modified=rtaskgCoT+(rtask+robfusc)gout.
If this mitigation is used in combination with Mind+Face, then the Mind is not reinforced for producing CoTs which increase the probability of output obfuscation. It is only reinforced for task completion.
Side note: I don’t love our “conditional spillover” and “parametric spillover” terminology, since they sound like different observed effects, whereas they are really two different (proposed) mechanisms underlying the same observed effect. If brevity weren’t an issue, I’d call them “spillover due to later tokens conditioning on earlier ones” and “spillover due to parameter sharing between the models generating CoT and output”, or something.