Davidmanheim comments on Interpretability Will Not Reliably Find Deceptive AI

Davidmanheim 5 May 2025 6:59 UTC
LW: 2 AF: 1
0
AF
CoT monitoring seems like a great control method when available

As I posted in a top level comment, I’m not convinced that even success would be a good outcome. I think that if we get this working 99.999% reliably. we still end up delegating parts of the oversight in ways that have other alignment failure modes, such as via hyper-introspection.