Nathan Helm-Burger comments on Nathan Helm-Burger’s Shortform

Nathan Helm-Burger 26 Jan 2025 19:36 UTC
4 points
1
Reasoning Model CoT faithfulness idea

As Janus and others have mentioned, I get a vibe of unfaithfulness/steganography from comparing DeepSeek r1′s reasoning traces to its actual outputs. I mean, not literally steganography, since I don’t ascribe any intentionality to this, just opacity arising naturally from the training process.

My recommendation:

Should be possible to ameliorate this with a simple ‘rephrasing’.

Process
1. Generate a bunch of CoTs on verifiable problems. Collect the ones where the answer is correct.
2. Have a different LLM rephrase the CoTs while preserving the semantic meaning. In addition to rephrasing, add a bit of inconsequential noise like extra spaces, or weird unnecessary punctuation, or translating all or part of it into a different language (e.g. Chinese, Hindi) or dialect (e.g. AAVE) or style (e.g. leetspeak). In order to keep this from resulting in the typical CoT from resulting in mixed language in deployment, add an instruction to the question before fine-tuning: Reason in <target_language> . Then you should be able to specify ‘Reason in English’ in deployment (or whatever language you prefer).
3. Check to see which of the rephrased CoTs still allow the target model to answer correctly. Fine-tune on just those.
4. Repeat
I think this will help at least a little. I also think it’s a good idea to rephrase the questions.
What links here?
- Nathan Helm-Burger's comment on DeepSeek Panic at the App Store by Zvi (28 Jan 2025 21:59 UTC; 2 points)
- Nathan Helm-Burger 30 Jan 2025 21:14 UTC
  2 points
  0
  Parent
  Related:
  
  https://x.com/mkurman88/status/1885042970447015941
  
  https://x.com/WenhuChen/status/1885060597500567562