Re steganography for chain-of-thought: I’ve been working on a project related to this for a while, looking at whether RL for concise and correct answers might teach models to stenographically encode their CoT for benign reasons. There’s an early write-up here: https://ac.felixbinder.net/research/2023/10/27/steganography-eval.html\
Currently, I’m working with two BASIS fellows on actually training models to see if we can elicit steganography this way. I’m definitely happy to chat more/set up a call about this topic
Re steganography for chain-of-thought: I’ve been working on a project related to this for a while, looking at whether RL for concise and correct answers might teach models to stenographically encode their CoT for benign reasons. There’s an early write-up here: https://ac.felixbinder.net/research/2023/10/27/steganography-eval.html\
Currently, I’m working with two BASIS fellows on actually training models to see if we can elicit steganography this way. I’m definitely happy to chat more/set up a call about this topic