For the ELK contest I was working on something similar, but where the problem is more acute (you’re trying to force a MuZero-style reinforcement learner to compress its hidden state into words).
The methods I proposed to deal with this are:
Cut down the null space by training a language model to rewrite the written words in a different format which preserves the meaning as closely as possible, so that the CoT reasoner is forced to respond to similar sentences in similar ways.
Use multiple models which are trained separately but have to respond to each others’ thoughts, making it more difficult to have a shared steganographic protocol.
Use something like dropout on the chains of thought, so that it can’t hide information in the combination of multiple sentences.
Have some non-linguistic bits which we don’t apply optimization to, but instead focus our intepretability efforts on. The hope is that this soaks up the highest-order bits that aren’t easily expressible in language,so there’s less need to hide them within words. Applying dropout to these is particularly necessary.
Yeah I agree demonstrating it and examining it is very important, especially in chain-of-thought where the existence of hidden information isn’t a certainty.
They’re a bit out of date but the original proposal is ELK Sub—Note-taking in internal rollouts and the multiple model protocol is in Note Taking Without Hidden Messages.
Haven’t done any proper testing but it’s high up on my agenda, would be interested in working out what tests would be best—will elaborate later.