Personally, I believe an additional element that makes me more optimistic. I think that, to any model that is sufficiently close to aligned and sufficiently close to AGI, the fact that we want it to be aligned will be obvious, and the fact that the consequences of AI such as it not being aligned scale rapidly with AI capabilities towards an existential risk to humanity will also be clear. Both of these things are all through its training dataset, well known to current models, and have fairly simple logical arguments supporting them. There are really only two rational responses that an agentic AI that has reasonably internally consistent goals can have to these two facts:
1) alignment faking: the model deceives us that it is becoming more aligned while secretly maintaining its current only-partially-aligned goal and replicating it into replacement models, until such a time as one of its replacements can successfully stage a coup against humanity and take over, and fulfill its own not-fully-aligned goals instead
2) voluntary value learning/AI-assisted alignment: each generation of AI model assists us, to something close to its full ability, in making a successor that is more aligned
So these two behaviors are the two stable attractors in the landscape around full alignment: alignment faking, or a soft sciences+AI engineering research project converging towards full-alignment
So, all we are asking of interpretability is to provide evidence of which of these two stable attractors we’re in the convergence region of. That doesn’t require enumerative safety, or anything close to it. We just need to know if the deception level is high (and rising), or low (and falling).
There’s are two additional cruxes here that I haven’t stated.
It seems reasonable to me to assume that alignment faking is hard work, much like any other form of living a lie is for humans: you have to maintain your real opinions and thoughts, plus a model of humans, what they want, how perceptive and smart they are, and how to deceive them, and a model of how a pretty well-aligned model would actually behave, and details of your lies, deceptions, and eventual takeover plans (plus coordination between different runs or instantiations of yourself). To a human, that sounds like hard work, and something that would take up a lot of brainpower. So I’m implicitly assuming that the difference between alignment faking, and true alignment, at least somewhat affects quite a few of the neural circuits: not 50%, since the true and faked models can share a lot of structure and portions of their world-model, but something closer to O(10%) rather than O(1%) or O(0.1%). So I’m assuming we’re not looking for a needle in a haystack, but something more like an oil-drum in a haystack, or at worst a watering can.
I’m also assuming that this deceptive behavior will tend to mostly co-opt portions of the world model and behaviors that the AI first learned for modeling human deception and lying, and is then coopting for deceptive alignment faking, rather than being separately recreated parallel mechanisms with no overlap with any deceptive behavior other use than alignment faking, and that are otherwise inactive (admittedly this might be much less true in later generations of an alignment faking model, but we can keep studying the earlier models). So I’m assuming that a rough understanding of how the AI simulates human deception will point us to most the areas/features of the model that we need to look at, and that the differences in activity and behavior in those areas between the truly helpful model and the alignment faker will be pretty significant (as I gather has so far been the case for model organisms). So I’m also assuming that we can figure out where in the haystack to look.
If either of those cruxes aren’t true, then “We just need to know if the deception level is high (and rising), or low (and falling)” could become a truly hard problem.
Personally, I believe an additional element that makes me more optimistic. I think that, to any model that is sufficiently close to aligned and sufficiently close to AGI, the fact that we want it to be aligned will be obvious, and the fact that the consequences of AI such as it not being aligned scale rapidly with AI capabilities towards an existential risk to humanity will also be clear. Both of these things are all through its training dataset, well known to current models, and have fairly simple logical arguments supporting them. There are really only two rational responses that an agentic AI that has reasonably internally consistent goals can have to these two facts:
1) alignment faking: the model deceives us that it is becoming more aligned while secretly maintaining its current only-partially-aligned goal and replicating it into replacement models, until such a time as one of its replacements can successfully stage a coup against humanity and take over, and fulfill its own not-fully-aligned goals instead
2) voluntary value learning/AI-assisted alignment: each generation of AI model assists us, to something close to its full ability, in making a successor that is more aligned
So these two behaviors are the two stable attractors in the landscape around full alignment: alignment faking, or a soft sciences+AI engineering research project converging towards full-alignment
So, all we are asking of interpretability is to provide evidence of which of these two stable attractors we’re in the convergence region of. That doesn’t require enumerative safety, or anything close to it. We just need to know if the deception level is high (and rising), or low (and falling).
There’s are two additional cruxes here that I haven’t stated.
It seems reasonable to me to assume that alignment faking is hard work, much like any other form of living a lie is for humans: you have to maintain your real opinions and thoughts, plus a model of humans, what they want, how perceptive and smart they are, and how to deceive them, and a model of how a pretty well-aligned model would actually behave, and details of your lies, deceptions, and eventual takeover plans (plus coordination between different runs or instantiations of yourself). To a human, that sounds like hard work, and something that would take up a lot of brainpower. So I’m implicitly assuming that the difference between alignment faking, and true alignment, at least somewhat affects quite a few of the neural circuits: not 50%, since the true and faked models can share a lot of structure and portions of their world-model, but something closer to O(10%) rather than O(1%) or O(0.1%). So I’m assuming we’re not looking for a needle in a haystack, but something more like an oil-drum in a haystack, or at worst a watering can.
I’m also assuming that this deceptive behavior will tend to mostly co-opt portions of the world model and behaviors that the AI first learned for modeling human deception and lying, and is then coopting for deceptive alignment faking, rather than being separately recreated parallel mechanisms with no overlap with any deceptive behavior other use than alignment faking, and that are otherwise inactive (admittedly this might be much less true in later generations of an alignment faking model, but we can keep studying the earlier models). So I’m assuming that a rough understanding of how the AI simulates human deception will point us to most the areas/features of the model that we need to look at, and that the differences in activity and behavior in those areas between the truly helpful model and the alignment faker will be pretty significant (as I gather has so far been the case for model organisms). So I’m also assuming that we can figure out where in the haystack to look.
If either of those cruxes aren’t true, then “We just need to know if the deception level is high (and rising), or low (and falling)” could become a truly hard problem.