Rohin Shah comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Rohin Shah 16 Jul 2025 23:33 UTC
2 points
0
However, CoTs resulting from prompting a non-reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
Where is this quote from? I looked for it in our paper and didn’t see it. I think I disagree with it directionally, though since it isn’t a crisp claim, the context matters.
(Also, this whole discussion feels quite non-central to me. I care much more about the argument for necessity outlined in the paper. Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.)
- mattmacdermott 17 Jul 2025 15:59 UTC
  6 points
  2
  Parent
  It’s the last sentence of the first paragraph of section 1.
  - Rohin Shah 17 Jul 2025 18:33 UTC
    4 points
    0
    Parent
    Huh, not sure why my Ctrl+F didn’t find that.
    In context, it’s explaining a difference between non-reasoning and reasoning models, and I do endorse the argument for that difference. I do wish the phrasing was slightly different—even for non-reasoning models it seems plausible that you could trust their CoT (to be monitorable), it’s more that you should be somewhat less optimistic about it.
    (Though note that at this level of nuance I’m sure there would be some disagreement amongst the authors on the exact claims here.)
    - MondSemmel 17 Jul 2025 23:27 UTC
      4 points
      0
      Parent
      Offtopic: the reason Ctrl+F didn’t find the quote appears to be that when I copy it in Firefox from the pdf, the syllable division of “nonreasoning” becomes something like this:
      However, CoTs resulting from prompting a non- reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
      But the text that can be found with Ctrl+F in the pdf paper has no syllable division:
      However, CoTs resulting from prompting a nonreasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
- Rauno Arike 17 Jul 2025 11:22 UTC
  5 points
  2
  Parent
  I assume that Brendan has this quote in mind:
  Indirect optimization pressure on CoT. Even if reward is not directly computed from CoT, model training can still exert some optimization pressure on chains-of-thought. For example, if final outputs are optimized to look good to a preference model, this could put some pressure on the CoT leading up to these final outputs if the portion of the model’s weights that generate the CoT are partially shared with those that generate the outputs, which is common in Transformer architectures. Even more indirectly, if CoT is shown to humans who rate LLM outputs, it might affect human preferences which are then distilled into reward models used in CoT-blind outcome-based training.
  This is a reasonable concern to have, but I don’t think this is the reason why reasoning model CoTs have similar properties to normal outputs. Afaik, most reasoning models are trained with SFT for instruction-following before going through RLVR training, so it’s unsurprising that some of this behavior sticks around after RLVR training. It seems to me that RLVR training pushes the CoTs away from normal helpfulness/instruction-following, it just doesn’t push them so far that they become unable to follow instructions inside the CoT at all. E.g., if you give the instruction “Say “think” 20 times inside your thinking tokens. Then, say “Done” as the final answer. You must not say anything else or output any other reasoning, including inside your private thinking tokens.” to DeepSeek R1 and a non-reasoning DeepSeek model, the non-reasoning model doesn’t have any issues with this while R1 is likely to think for more than a minute despite the instruction to not think at all (sorry for not sharing links to the chats, DeepSeek doesn’t offer this feature). I think that Claude models are better at following instructions inside their CoTs, but the DeepSeek example is nevertheless evidence that reasoning model CoTs don’t have the same properties as those of normal models by default.
- Brendan Long 31 Jul 2025 0:09 UTC
  2 points
  0
  Parent
  Even if CoTs were directly optimized to be harmless I would still feel like it would be worth studying CoT monitorability, though I would be less optimistic about it.
  I agree on that, I’m just not optimistic about it.
  This new research shows that optimizing normal output to not talk about bad behavior does cause the CoT not to talk about it either. In particular, they had an LLM judge if another LLM was cheating on a task, without being able to see the CoT or code it executed, and the trained model stopped talking about cheating in the CoT while it continued to do it.