That makes sense. But I don’t expect defenders to reliably take fully appropriate counter-measures like undeploying a model and rethinking the whole alignment approach if they catch a model misbehaving even once.
I still think research like this is highly valuable by giving us better ideas of what monitoring will and won’t catch. Having a plan and realistic expectations of how the whole process works, including what happens if we do catch an egregiously misaligned model, is also necessary.
That makes sense. But I don’t expect defenders to reliably take fully appropriate counter-measures like undeploying a model and rethinking the whole alignment approach if they catch a model misbehaving even once.
Buck seems to agree: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
I still think research like this is highly valuable by giving us better ideas of what monitoring will and won’t catch. Having a plan and realistic expectations of how the whole process works, including what happens if we do catch an egregiously misaligned model, is also necessary.