You trained the model to produce an unfaithful CoT? That seems highly relevant to interpreting the results! The results are still very relevant, since we worry about training procedures that induce unfaithful CoT in subtle as well as obvious ways.
Yes, the fine-tuned model is denoted with an “SFT” for Supervised Fine-tuning, to distinguish it in the results from the non-fine-tuned Qwen-14B-r1-distill. Many models were tested as the malicious model, including r1, various r1 distills, Claude Sonnet 3.7, and Grok 3 mini.
You trained the model to produce an unfaithful CoT? That seems highly relevant to interpreting the results! The results are still very relevant, since we worry about training procedures that induce unfaithful CoT in subtle as well as obvious ways.
Yes, the fine-tuned model is denoted with an “SFT” for Supervised Fine-tuning, to distinguish it in the results from the non-fine-tuned Qwen-14B-r1-distill. Many models were tested as the malicious model, including r1, various r1 distills, Claude Sonnet 3.7, and Grok 3 mini.