Daniel Kokotajlo comments on Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Daniel Kokotajlo 16 Jul 2025 15:57 UTC
LW: 5 AF: 5
2
AF
My memory agrees with Rohin. Some safety people were trying to hold the line but most weren’t, I don’t think it reached as much consensus as this CoT monitorability just did.
I think it’s good to defend lines even if they get blown past, unless you have a better strategy this trades off against. “Defense in depth” “Fighting retreat” etc.
(Unimportant: My position has basically been that process-based reinforcement and outcome-based reinforcement are interesting to explore separately but that mixing them together would be bad, and also, process-based reinforcement will not be competitive capabilities-wise.)