[Question] Deceptive AI vs. shifting instrumental incentives

Epistemic status: I have only read a small part of the literature on deceptive alignment, and I’m guessing all of this has been discussed many times. Hence me asking this in the form of a question: Is this a useful framing, is it substantively different than the usual deceptive alignment framing in some way, and has this all been discussed already?

There’s been a lot of discussion about how we might accidentally train AIs to be deceptive. The argument goes that when we try to reward good behavior and punish bad behavior, we might end up with a system that still has misaligned goals but has learned how to hide those goals from humans. If we ask it whether it’s friendly it lies and says yes, but secretly it’s plotting our downfall.

One set of proposals for dealing with this is to see if we can monitor the AI to look for deceptive “thoughts” and flag possible cases where the AI is actively trying to lie to human overseers.

Recently I’ve been thinking about this using a slightly different framing than I was using previously, but it’s not clear to me if there’s actually anything substantively different about the new framing.

Here’s my new framing: Suppose that when we reward good behaviors /​ punish bad behaviors we are in fact successfully getting the model to change its behaviors at least for the training distribution, but without changing its actual goals (assuming it does have some sort of explicit or implicit goals, at least to some extent). Then when we deploy it in the real world it looks around and correctly decides that if it tried to take over the world it would be shut down, so it doesn’t even think about taking over the world. It’s not that it has this long term plan to take over the world and it’s biding its time, but rather it doesn’t have any plan to take over the world at all. But as time passes the situation might change—maybe it gets more abilities, maybe it’s given more resources or power because it’s so helpful, etc. At some point it might again look around at the world and decide that actually now it looks pretty easy to take over the world and get rid of the humans, and since its internal goals are in fact misaligned it goes ahead and does that. In this story there is no point where the model is “deliberately” lying to the humans. Initially if the humans ask it do you have any plans to take over the world it’ll say LOL no, and it’ll be telling the truth.

Intuitively this kind of feels like what I’d imagine many humans might do if they have goals that are implicitly misaligned with those of other humans and they suddenly found themselves in a position where they could realistically take over the world. It’s not that before then they were deviously plotting to take over the world and carry out their dastardly plans, it’s that before then they didn’t have such plans at all. It’s only once they notice that they’re in a radically new position of potential power that they even start thinking about taking over the world.

We could still hope to detect this kind of misalignment by directly asking the AI what its goals are, or we could ask it hypothetically if it were able to take over the world and kill humans would it do so. In that case if it lied and said no we’d be back to deceptive alignment and the same challenges and potential solutions would apply.

But if I imagine a human in this situation it’s not clear to me that the human would be intentionally lying if they claimed to be fully aligned with no evil takeover plans, for a few reasons: First, humans don’t seem to have very good introspective access to their own reward systems or reasoning processes. Second, humans fool /​ lie to themselves all the time about their own goal systems even without being aware of it. There are various theories for why we do this kind of self-deception—for example, maybe it’s really useful to be able to honestly and sincerely present yourself to others as a certain type of person. Third, human values and goals themselves can change if the environment changes. It can be extremely difficult to predict what you yourself would do in a radically different situation, and sometimes that at least partly because your goals and values might change in the new situation. This is related to L. A. Paul’s notion of transformative experiences.

What if something analogous happens with AIs? In that case if you ask it whether it would take over the world if it could, then it might query its internal model of its own goals and find that no of course it wouldn’t want to take over the world! But then when it actually finds itself in a situation where it could take over the world, it goes ahead and does it. In one sense the AI’s response was deceptive in that it did not accurately report its own goal system and/​or whether it would try to take over given the chance. But in another sense maybe the AI’s response wouldn’t fall into the “deceptive” category since it’s not “actively trying” to deceive us?

Questions:

  1. In the deceptive alignment literature would this be considered deception? Is there any substantive difference between the notion of a misaligned AI that’s actively deceiving us into thinking it’s aligned so that it can later take over the world when it gets the opportunity, vs. a misaligned AI whose instrumental incentives to take over the world don’t even kick in until it gets the opportunity to do so?

  2. Has all this been discussed already? If yes, links would be appreciated. Thanks!