Rohin Shah comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Rohin Shah 17 Jul 2025 18:33 UTC
4 points
0
Huh, not sure why my Ctrl+F didn’t find that.
In context, it’s explaining a difference between non-reasoning and reasoning models, and I do endorse the argument for that difference. I do wish the phrasing was slightly different—even for non-reasoning models it seems plausible that you could trust their CoT (to be monitorable), it’s more that you should be somewhat less optimistic about it.
(Though note that at this level of nuance I’m sure there would be some disagreement amongst the authors on the exact claims here.)
- MondSemmel 17 Jul 2025 23:27 UTC
  4 points
  0
  Parent
  Offtopic: the reason Ctrl+F didn’t find the quote appears to be that when I copy it in Firefox from the pdf, the syllable division of “nonreasoning” becomes something like this:
  However, CoTs resulting from prompting a non- reasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.
  But the text that can be found with Ctrl+F in the pdf paper has no syllable division:
  However, CoTs resulting from prompting a nonreasoning language model are subject to the same selection pressures to look helpful and harmless as any other model output, limiting their trustworthiness.