Nathan Helm-Burger comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Nathan Helm-Burger 6 Jun 2025 16:52 UTC
2 points
0
Yes, the fine-tuned model is denoted with an “SFT” for Supervised Fine-tuning, to distinguish it in the results from the non-fine-tuned Qwen-14B-r1-distill. Many models were tested as the malicious model, including r1, various r1 distills, Claude Sonnet 3.7, and Grok 3 mini.