Oscar comments on A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Oscar 17 Nov 2025 18:14 UTC
2 points
0
I only read the LW version not the paper, but this seems like important work to me and I’m glad you’re doing it! What did you make of these two recent papers?
- https://arxiv.org/abs/2510.23966
- https://www.arxiv.org/abs/2510.27378
I have done some work on the policy side of this (whether we should/how we could enforce CoT monitorability on AI developers, or at least gain transparency into how monitorable SOTA models are). Lmk if ever it would be useful to talk about that, otherwise I will be keen to see where this line of work ends up!
- Wuschel Schulz 28 Nov 2025 14:07 UTC
  2 points
  0
  Parent
  Thank you for the comment.
  
  About the papers you mentioned, that propose metrics for CoT monitorability:
  
  I think these metrics are useful to keep an eye on, but do not measure what I care most about in CoT monittorability.
  
  They can not detect steganography, so the reasoning model might do encoded reasoning, and they would still be rated as ‘monitorable’.
  
  More importantly: just because a reasoning model has a low monitorability score on those metrics, that does not mean we cannot usefully deploy CoT monitoring:
  A model confabulating nonsens reasons for answers, and not mentioning some crucial consideration for an answer would give it a low score on those metrics. But as long as the model still needs to do monitorable reasoning before it can cause a catastrophe, we can still rely on CoT monitoring in safety cases.
  
  On the policy side:
  
  I have seen your Policy Options report and cited it in my paper. I largely agree with your analysis and have little more to add on the policy side.
  
  I think “CoT-assisted adversarial evaluation” is a good Idea that might make for a useful evaluation. If you ever want someone to implement that in an evaluation, I might be up for doing that.