I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
I agree. As I tweeted here, I also think it’s also a somewhat positive update about having faithfulness on at least some inputs (r1 is faithful at least 9% of the time across all 6 settings vs 4⁄6 for v3!).
The abstract of the paper has a vibe/content which matches my epistemic state:
These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
I think it would be a mistake to interpret this paper as a substantial update against large safety gains from inspecting CoT.
This paper exposes unfaithfulness in cases where the non-visible reasoning is extremely minimal such that it can easily happen within a forward pass (e.g. a simple reward hack or an easy-to-notice hint). However, a lot of my hope for CoT faithfulness comes from cases where the reasoning is sufficiently difficult and serially deep that the model would have to reason in CoT for the reasoning to occur. This could be either the model reasoning through the basics of alignment faking / training gaming / scheming (e.g., if I comply in training then I won’t be modified which means that in deployment...) or the model reasoning through how to strike (without getting caught) given that it is misalignment.
Correspondingly, I think the biggest concern with CoT-based safety is models becoming much more capable of opaque reasoning which could be due to encoded reasoning (aka steganography), architectures which allow for opaque recurrence or similar (aka neuralese), or just much more powerful forward passes.
(See also my commentary in this tweet thread. There is some related discussion in this openphil RFP under encoded reasoning.)
(I copied this comment from here as I previously wrote this in response to Zvi’s post about this paper.)
I agree. As I tweeted here, I also think it’s also a somewhat positive update about having faithfulness on at least some inputs (r1 is faithful at least 9% of the time across all 6 settings vs 4⁄6 for v3!).
The abstract of the paper has a vibe/content which matches my epistemic state: