As Janus and others have mentioned, I get a vibe of unfaithfulness/steganography from comparing DeepSeek r1′s reasoning traces to its actual outputs. I mean, not literally steganography, since I don’t ascribe any intentionality to this, just opacity arising naturally from the training process.
My recommendation:
Should be possible to ameliorate this with a simple ‘rephrasing’.
Process
Generate a bunch of CoTs on verifiable problems. Collect the ones where the answer is correct.
Have a different LLM rephrase the CoTs while preserving the semantic meaning. In addition to rephrasing, add a bit of inconsequential noise like extra spaces, or weird unnecessary punctuation, or translating all or part of it into a different language (e.g. Chinese, Hindi) or dialect (e.g. AAVE) or style (e.g. leetspeak). In order to keep this from resulting in the typical CoT from resulting in mixed language in deployment, add an instruction to the question before fine-tuning: Reason in <target_language> . Then you should be able to specify ‘Reason in English’ in deployment (or whatever language you prefer).
Check to see which of the rephrased CoTs still allow the target model to answer correctly. Fine-tune on just those.
Repeat
I think this will help at least a little. I also think it’s a good idea to rephrase the questions.
Reasoning Model CoT faithfulness idea
As Janus and others have mentioned, I get a vibe of unfaithfulness/steganography from comparing DeepSeek r1′s reasoning traces to its actual outputs. I mean, not literally steganography, since I don’t ascribe any intentionality to this, just opacity arising naturally from the training process.
My recommendation:
Should be possible to ameliorate this with a simple ‘rephrasing’.
Process
Generate a bunch of CoTs on verifiable problems. Collect the ones where the answer is correct.
Have a different LLM rephrase the CoTs while preserving the semantic meaning. In addition to rephrasing, add a bit of inconsequential noise like extra spaces, or weird unnecessary punctuation, or translating all or part of it into a different language (e.g. Chinese, Hindi) or dialect (e.g. AAVE) or style (e.g. leetspeak). In order to keep this from resulting in the typical CoT from resulting in mixed language in deployment, add an instruction to the question before fine-tuning:
Reason in <target_language>. Then you should be able to specify ‘Reason in English’ in deployment (or whatever language you prefer).Check to see which of the rephrased CoTs still allow the target model to answer correctly. Fine-tune on just those.
Repeat
I think this will help at least a little. I also think it’s a good idea to rephrase the questions.
Related:
https://x.com/mkurman88/status/1885042970447015941
https://x.com/WenhuChen/status/1885060597500567562