You could consider having a setup like the ToM task described in our paper, but you modify the output to encourage a brief discussion / recap of the scenario outside of the CoT. You could then penalize the specific names in the output (but not in the CoT) and see if steganography arises in the output only.
The main issue is that mentions of names after the reasoning occurs in the CoT is unlikely to be load bearing, so any penalization encourages just dropping the mention of the names all together instead of a steganographic replacement.
We partnered with McCarthy et al and found some success inducing steganographic CoT via RL and found that penalizing a small set of names within a theory of mind task generalized to producing nick-names or filler tokens for names never seen in the training set (and we recently confirmed that these filler-names can be load bearing).
You could consider having a setup like the ToM task described in our paper, but you modify the output to encourage a brief discussion / recap of the scenario outside of the CoT. You could then penalize the specific names in the output (but not in the CoT) and see if steganography arises in the output only.
The main issue is that mentions of names after the reasoning occurs in the CoT is unlikely to be load bearing, so any penalization encourages just dropping the mention of the names all together instead of a steganographic replacement.