Recent posts have been optimistic about using transparency tools to detect deceptive behavior. This post argues that we may not want to use transparency tools, because then the deceptive model can simply adapt to fool the transparency tools. Instead, we need something more like an end-to-end trained deception checker that’s about as smart as the deceptive model, so that the deceptive model can’t fool it.
My opinion:
In a comment, Evan Hubinger makes a point I agree with: the transparency tools don’t need to be able to detect all deception; they just need to prevent the model from developing deception. If deception gets added slowly (i.e. the model doesn’t “suddenly” become perfectly deceptive), then this can be way easier than detecting deception in arbitrary models, and could be done by tools.
My summary for the Alignment Newsletter:
My opinion: