Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF’ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:
Get a bunch of answers from the model using CoT prompting.
Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
Check if the model loses accuracy for paraphrased CoTs.
The basic result is that paraphrasing parts of the CoT doesn’t appear to reduce accuracy.
Note that this paper in section 2.6 claims to demonstrate that some 175 billion parameter RLHF’ed LLM (presumably Claude or Claude 2?) does not use steganography in its CoT. It does this by:
Get a bunch of answers from the model using CoT prompting.
Take prefixes of CoTs, asking another language model to paraphrase them, regenerate the rest of the CoT, and see how accurate the models are now.
Check if the model loses accuracy for paraphrased CoTs.
The basic result is that paraphrasing parts of the CoT doesn’t appear to reduce accuracy.