Seth Herd comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Seth Herd 4 Jun 2025 17:33 UTC
4 points
2
That makes sense. But I don’t expect defenders to reliably take fully appropriate counter-measures like undeploying a model and rethinking the whole alignment approach if they catch a model misbehaving even once.
Buck seems to agree: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
I still think research like this is highly valuable by giving us better ideas of what monitoring will and won’t catch. Having a plan and realistic expectations of how the whole process works, including what happens if we do catch an egregiously misaligned model, is also necessary.