Seth Herd comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Seth Herd 4 Jun 2025 17:24 UTC
3 points
1
You trained the model to produce an unfaithful CoT? That seems highly relevant to interpreting the results! The results are still very relevant, since we worry about training procedures that induce unfaithful CoT in subtle as well as obvious ways.
- Nathan Helm-Burger 6 Jun 2025 16:52 UTC
  2 points
  0
  Parent
  Yes, the fine-tuned model is denoted with an “SFT” for Supervised Fine-tuning, to distinguish it in the results from the non-fine-tuned Qwen-14B-r1-distill. Many models were tested as the malicious model, including r1, various r1 distills, Claude Sonnet 3.7, and Grok 3 mini.