Nathan Helm-Burger comments on Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring

Nathan Helm-Burger 4 Jun 2025 1:18 UTC
2 points
0
No, actually I think things are looking pretty good. The thing about really subtle stuff, like sabotaging code in non-obvious ways is that you’d likely need to successfully do a lot of such actions in order to get lucky enough to place the flaw in some sufficiently key piece of software that it let you make progress towards elevating your permissions and potentially escaping the lab.

The non-subtle stuff on the other hand, like downloading an unauthorized file, is potentially serious enough of a violation that even a single failure would have a high probability of compromising the system. Thus, you need a much better detection rate on the non-subtle but high-impact actions than you do on the subtle low-impact actions.

So, I agree with Buck that this situation looks promising for the defenders if the defenders expect to catch only a fraction of actions, and then actually take appropriate counter-measures once they catch the untrusted AI misbehaving even once.
- Seth Herd 4 Jun 2025 17:33 UTC
  4 points
  2
  Parent
  That makes sense. But I don’t expect defenders to reliably take fully appropriate counter-measures like undeploying a model and rethinking the whole alignment approach if they catch a model misbehaving even once.
  Buck seems to agree: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
  I still think research like this is highly valuable by giving us better ideas of what monitoring will and won’t catch. Having a plan and realistic expectations of how the whole process works, including what happens if we do catch an egregiously misaligned model, is also necessary.