I only read the LW version not the paper, but this seems like important work to me and I’m glad you’re doing it! What did you make of these two recent papers?
I have done some work on the policy side of this (whether we should/how we could enforce CoT monitorability on AI developers, or at least gain transparency into how monitorable SOTA models are). Lmk if ever it would be useful to talk about that, otherwise I will be keen to see where this line of work ends up!
About the papers you mentioned, that propose metrics for CoT monitorability:
I think these metrics are useful to keep an eye on, but do not measure what I care most about in CoT monittorability.
They can not detect steganography, so the reasoning model might do encoded reasoning, and they would still be rated as ‘monitorable’.
More importantly: just because a reasoning model has a low monitorability score on those metrics, that does not mean we cannot usefully deploy CoT monitoring: A model confabulating nonsens reasons for answers, and not mentioning some crucial consideration for an answer would give it a low score on those metrics. But as long as the model still needs to do monitorable reasoning before it can cause a catastrophe, we can still rely on CoT monitoring in safety cases.
On the policy side:
I have seen your Policy Options report and cited it in my paper. I largely agree with your analysis and have little more to add on the policy side.
I think “CoT-assisted adversarial evaluation” is a good Idea that might make for a useful evaluation. If you ever want someone to implement that in an evaluation, I might be up for doing that.
I only read the LW version not the paper, but this seems like important work to me and I’m glad you’re doing it! What did you make of these two recent papers?
https://arxiv.org/abs/2510.23966
https://www.arxiv.org/abs/2510.27378
I have done some work on the policy side of this (whether we should/how we could enforce CoT monitorability on AI developers, or at least gain transparency into how monitorable SOTA models are). Lmk if ever it would be useful to talk about that, otherwise I will be keen to see where this line of work ends up!
Thank you for the comment.
About the papers you mentioned, that propose metrics for CoT monitorability:
I think these metrics are useful to keep an eye on, but do not measure what I care most about in CoT monittorability.
They can not detect steganography, so the reasoning model might do encoded reasoning, and they would still be rated as ‘monitorable’.
More importantly: just because a reasoning model has a low monitorability score on those metrics, that does not mean we cannot usefully deploy CoT monitoring:
A model confabulating nonsens reasons for answers, and not mentioning some crucial consideration for an answer would give it a low score on those metrics. But as long as the model still needs to do monitorable reasoning before it can cause a catastrophe, we can still rely on CoT monitoring in safety cases.
On the policy side:
I have seen your Policy Options report and cited it in my paper. I largely agree with your analysis and have little more to add on the policy side.
I think “CoT-assisted adversarial evaluation” is a good Idea that might make for a useful evaluation. If you ever want someone to implement that in an evaluation, I might be up for doing that.