Daniel Kokotajlo comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Daniel Kokotajlo 2 Jun 2025 21:48 UTC
3 points
1
Did you train the monitors at all? Did you train the malicious model at all? Seems like you didn’t. My guess is that this problem would go away if you could fine-tune the monitors against an un-fine-tuned malicious model. (Which is I think appropriate, because hopefully in real life the companies won’t be training their untrusted AIs against their own monitoring system! That would be obviously stupid! But training the monitoring system on examples of bad behavior, that’s obviously good.)
- Nathan Helm-Burger 2 Jun 2025 22:09 UTC
  6 points
  0
  Parent
  I did train the malicious model, in two different ways, and reported on the more successful version (supervised fine-tuning to be stealthier in its CoT).
  
  I agree that training the monitor would help, both with it’s raw detection capability and with its resistance to jailbreaks from the malicious model. I think the jailbreak problem could plausibly be completely solved this way, although perhaps not easily. (If fine-tuning could easily remove vulnerability to jailbreaks, then why are the leading frontier models still vulnerable?) This seems like an important direction for future work.
  
  The raw detection ability of the monitor to catch increasingly smart and stealthy malicious models? I don’t think this does get entirely solved by fine-tuning a weaker monitor, unless you also pay the additional “safety tax” of restricting the complexity of the tasks being done by the malicious model in order to prevent it working on tasks that the monitor gets confused by.
  - Daniel Kokotajlo 3 Jun 2025 0:05 UTC
    5 points
    3
    Parent
    Sorry I miscommunicated—when I said “this problem would go away” I didn’t mean “the problem of AIs being unfaithful in their CoT and fooling monitors.” I don’t expect that problem to go away nearly so easily!
    
    All I meant was, the problem of the monitor which can see the CoT performing worse than the monitor which can only see the actions. The monitor which can see both the actions and the CoT, if it was trained, would eventually learn to just ignore the CoT and focus on the actions, and thereby perform at least as well as the monitor which can see both the actions and the CoT.
    - Nathan Helm-Burger 3 Jun 2025 1:06 UTC
      2 points
      0
      Parent
      That makes more sense! Yes, I think that’s possibly true… Unless the malicious model gets good at jailbreaking, in which case even fine-tuning the monitor to mostly ignore the CoT would be risky.
      
      And if the monitor completely ignores the CoT, then it loses out on the benefits of seeing the CoT. The CoT can be importantly helpful for detecting subtle sabotage by a stealthy malicious model. For example, the fine-tuned malicious model had a 0% chance of getting caught on subtle tasks by the monitor not using CoT. With CoT, that went up to like 3% catch rate. Small but important difference.
  - Seth Herd 4 Jun 2025 17:24 UTC
    3 points
    1
    Parent
    You trained the model to produce an unfaithful CoT? That seems highly relevant to interpreting the results! The results are still very relevant, since we worry about training procedures that induce unfaithful CoT in subtle as well as obvious ways.
    - Nathan Helm-Burger 6 Jun 2025 16:52 UTC
      2 points
      0
      Parent
      Yes, the fine-tuned model is denoted with an “SFT” for Supervised Fine-tuning, to distinguish it in the results from the non-fine-tuned Qwen-14B-r1-distill. Many models were tested as the malicious model, including r1, various r1 distills, Claude Sonnet 3.7, and Grok 3 mini.