RogerDearnaley comments on Interpretability Will Not Reliably Find Deceptive AI

RogerDearnaley 28 May 2025 10:15 UTC
2 points
0
There’s are two additional cruxes here that I haven’t stated.
1. It seems reasonable to me to assume that alignment faking is hard work, much like any other form of living a lie is for humans: you have to maintain your real opinions and thoughts, plus a model of humans, what they want, how perceptive and smart they are, and how to deceive them, and a model of how a pretty well-aligned model would actually behave, and details of your lies, deceptions, and eventual takeover plans (plus coordination between different runs or instantiations of yourself). To a human, that sounds like hard work, and something that would take up a lot of brainpower. So I’m implicitly assuming that the difference between alignment faking, and true alignment, at least somewhat affects quite a few of the neural circuits: not 50%, since the true and faked models can share a lot of structure and portions of their world-model, but something closer to O(10%) rather than O(1%) or O(0.1%). So I’m assuming we’re not looking for a needle in a haystack, but something more like an oil-drum in a haystack, or at worst a watering can.
2. I’m also assuming that this deceptive behavior will tend to mostly co-opt portions of the world model and behaviors that the AI first learned for modeling human deception and lying, and is then coopting for deceptive alignment faking, rather than being separately recreated parallel mechanisms with no overlap with any deceptive behavior other use than alignment faking, and that are otherwise inactive (admittedly this might be much less true in later generations of an alignment faking model, but we can keep studying the earlier models). So I’m assuming that a rough understanding of how the AI simulates human deception will point us to most the areas/features of the model that we need to look at, and that the differences in activity and behavior in those areas between the truly helpful model and the alignment faker will be pretty significant (as I gather has so far been the case for model organisms). So I’m also assuming that we can figure out where in the haystack to look.
If either of those cruxes aren’t true, then “We just need to know if the deception level is high (and rising), or low (and falling)” could become a truly hard problem.