Have you tried experimenting with models other than Qwen? Similarly to Sam, I’d also expect spillover to decrease with size and the amount of reasoning post-training.
If you’re penalizing a single word, you might be just unconditionally decreasing weights in unembed controlling logits of tokens for this word. But this strategy doesn’t work for more complicated types of obfuscation.
Conditional feedback spillover: Since later tokens are conditioned on earlier tokens, safe‑looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced.
Mitigations are not discussed in this post; they are left for future work.
Potentially you might have two separate vocabularies, for CoT and for final outputs. This should reduce spillover.
We haven’t experimented with models other than Qwen3-4B, but we will in follow-up work.
On the point regarding unconditionally down-weighting the penalized word, this was our primary motivation for trying the judge penalty. We agree that it is a strong reason to be skeptical of the regex penalty results in isolation. Our future work will only focus on more complex penalties, so we hope to gain more evidence on this soon.
Cool research!
Have you tried experimenting with models other than Qwen? Similarly to Sam, I’d also expect spillover to decrease with size and the amount of reasoning post-training.
If you’re penalizing a single word, you might be just unconditionally decreasing weights in unembed controlling logits of tokens for this word. But this strategy doesn’t work for more complicated types of obfuscation.
Potentially you might have two separate vocabularies, for CoT and for final outputs. This should reduce spillover.
We haven’t experimented with models other than Qwen3-4B, but we will in follow-up work.
On the point regarding unconditionally down-weighting the penalized word, this was our primary motivation for trying the judge penalty. We agree that it is a strong reason to be skeptical of the regex penalty results in isolation. Our future work will only focus on more complex penalties, so we hope to gain more evidence on this soon.