Deceptive AI ≠ Deceptively-aligned AI

Tl;dr: A “deceptively-aligned AI” is different from (and much more specific than) a “deceptive AI”. I think this is well-known and uncontroversial among AI Alignment experts, but I see people getting confused about it sometimes, so this post is a brief explanation of how they differ. You can just look at the diagram below for the upshot.

Some motivating context: There have been a number of recent arguments that future AI is very unlikely to be deceptively-aligned. Others disagree, and I don’t know which side is right. But I think it’s important for non-experts to be aware that this debate is not about whether future powerful AI is likely to engage in deliberate deception. Indeed, while the arguments for deceptive alignment are (IMO) pretty complex and contentious, I will argue that there are very much stronger and more straightforward reasons to expect future powerful AI to be deceptive, at least sometimes, in the absence of specific interventions to avoid that.

1. Definitions

Deceptive alignment is a particular scenario where:

  1. A “ground-truth system” (possibly individual human evaluators, or possibly an automated system of some sort) provides an ML model with training signals (rewards if this is reinforcement learning (RL), supervisory ground truth signals if this is supervised or self-supervised learning (SL)),

  2. The AI starts emitting outputs that humans might naively interpret as evidence that training is going as intended—typically high-reward outputs in RL and low-loss outputs in SL (but a commenter notes here that “evidence that training is going as intended” is potentially more nuanced than that).

  3. …but the AI is actually emitting those outputs in order to create that impression—more specifically, the AI has situational awareness and a secret desire for some arbitrary thing X, and the AI wants to not get updated and/​or it wants to get deployed, so that it can go make X happen, and those considerations lie behind why the AI is emitting the outputs that it’s emitting.

(For example, maybe the AI would reason that emitting high-reward or low-loss outputs would minimize the extent to which ongoing training will change its inclinations, which it would probably see as bad by the usual instrumental-convergence argument. So then it would seem to be performing well, but it’s performing well for problematic reasons. For another thing, the AI might have secret desires that can only be fulfilled if the humans deploy it into the world, and might reason that emitting certain outputs would make the humans more likely to deploy it.)

(I’m not trying hard to explain this part very well; if you’re confused, try reading the original source where the term “deceptive alignment” was coined in 2019, or Joe Carlsmith’s report, or many other discussions.)

By contrast, deception is much broader—it’s any situation where the AI is interacting with humans for any reason, and the AI deceives a human by knowingly providing them with false or misleading information.

Thus, deceptive alignment would be a special case of deception—namely, the case where deception occurs in the context of model training, and involves the AI emitting (typically) low-loss /​ high-reward outputs, in order to hide its secret ulterior motives, and to create a false impression that the training run is going as planned.

2. Very simple example of “deception” that is not “deceptive alignment”

Suppose I use RL to train an AI to make money, and that I do so in the most obvious way possible—I give the AI an actual real-world bank account, and set its RL reward signal to be positive whenever the account balance goes up, and negative when the account balance goes down.

If I did this today, the trained model would probably fail to accomplish anything at all. But let us suppose that future RL techniques will work better than today’s, such that this training would lead to an AI that starts spear-phishing random people on the internet and tricking them into wiring money into the AI’s bank account.

Such an AI would be demonstrating “deception”, because its spear-phishing emails are full of deliberate lies. But this AI would probably not be an example of “deceptive alignment”, per the definition above.

For example, deceptive alignment requires situational awareness by definition. But the AI above could start spear-phishing even if it isn’t situationally aware—i.e., even if the AI does not know that it is an AI, being updated by RL Algorithm X, set up by the humans in Company Y, and those humans are now watching its performance and monitoring Metrics A, B, and C, etc.

(That previous paragraph is supposed to be obvious—it’s no different from the fact that humans are perfectly capable of spear-phishing even when they don’t know anything about neuroscience or evolution.)

3. I think we should strongly expect future AIs to sometimes be deceptive (in the absence of a specific plan to avoid that), even if “deceptive alignment” is unlikely

There is a lively ongoing debate about the likelihood of “deceptive alignment”—see for example Evan Hubinger arguing that deceptive alignment is likely, DavidW arguing that deceptive alignment is extremely unlikely (<1%), and Joe Carlsmith’s 127-page report somewhere in between (“roughly 25%”), and more at this link. (These figures are all “by default”, i.e. in the absence of some specific intervention or change in training approach.)

I don’t know which side of that debate is right.

But “deception” is a much broader category than “deceptive alignment”, and I think there’s a very strong and straightforward case that, as we make increasingly powerful AIs in the future, if those AIs interact with humans in any way, then they will sometimes be deceptive, in the absence of specific interventions to avoid that. As three examples of how such deception may arise:

  • If humans are part of the AI’s training environment (example: reinforcement learning in a real-world environment): The spear-phishing example above was a deliberately extreme case for clarity, but the upshot is general and robust: if an AI is trying to accomplish pretty much anything, and it’s able to interact with humans while doing so, then it will do a better job if it’s open-minded to being strategically deceptive towards those humans in certain situations. Granted, sometimes “honesty is the best policy”. But that’s just a rule-of-thumb which has exceptions, and we should expect the AI to exploit those exceptions, just as we expect future powerful AI to exploit all the other affordances in its environment.

  • If the AI is trained by human imitation (example: self-supervised learning of internet text data, which incidentally contains lots of human dialog): Well, humans deceive other humans sometimes, and that’s likely to wind up in the training data, so such an AI would presumably wind up with the ability and tendency to be occasionally deceptive.

  • If AI’s training signals rely on human judgments (as in RLHF): This training signal incentivizes the AI to be sycophantic—to tell the judge what they want to hear, pump up their ego, and so on, which (when done knowingly and deliberately) is a form of deception. For example, if the AI knows that X is true, but the human judge sees X as an outrageous taboo, then the AI is incentivized to tell the judge “oh yeah, X is definitely false, I’m sure of it”.

Again, my claim is not that these problems are unavoidable, but rather that they are expected in the absence of a specific intervention to avoid them. Such interventions may exist, for all I know! Work is ongoing. For the first bullet point, I have some speculation here about what it might take to generate an AI with an intrinsic motivation to be honest; for the second bullet point, maybe we can curate the training data; and the third bullet point encompasses numerous areas of active research, see e.g. here.

(Separately, I am not claiming that AIs-that-are-sometimes-deceptive is a catastrophically dangerous problem and humanity is doomed. I’m just making a narrow claim.)

Anyway, just as one might predict from the third bullet point, today’s LLMs are indeed at least somewhat sycophantic. So, does that mean that GPT-4 and other modern LLMs are “deceptive”? Umm, I’m not sure. I said in the third bullet point that sycophancy only counts as “deception” when it’s “done knowingly and deliberately”—i.e., the AI explicitly knows that what it’s saying is false or misleading, and says it anyway. I’m not sure if today’s LLMs are sophisticated enough for that. Maybe they are, or maybe not, I don’t know. An alternative possibility is that today’s LLMs are sincere in their sycophancy. Or maybe even that would be over-anthropomorphizing. But anyway, even if today’s LLMs are sycophantic in a way that does not involve deliberate deception, I expect that this is only true because of AI capability limitations, and these limitations will presumably go away as AI technology advances.

Bonus: Three examples that spurred me to write this post

  • I myself have felt confused about this distinction a couple times, at least transiently.

  • I was just randomly reading this comment, where an expert used the terms “deception” and “deceptive agents” in a context where (I claim) they should have said “deceptive alignment” and “deceptively-aligned agents” respectively. Sure, I knew what they meant, but I imagine some people coming across that text would get the wrong idea. I’m pretty sure I’ve seen this kind of thing numerous times.

  • Zvi Mowshowitz here was very confused by claims (by Joe Carlsmith, Quintin Pope, and Nora Belrose) that AI was unlikely to be deceptively-aligned—instead, Zvi’s perspective was “I am trying to come up with a reason this isn’t 99%?” I’m not certain, but I think the explanation might be that Zvi was thinking of “deception”, whereas Joe, Quintin, and Nora were talking about the more specific “deceptive alignment”. For example, consider what Joe Carlsmith calls “training gamers”, or what Zvi might call “an AI with the terminal goal of ‘guessing the teacher’s password’”. Joe would call that a negative example (i.e., not deceptive alignment, or in his terminology “not scheming”), but maybe Zvi would call that a positive example because the AI is misaligned and deceptive.

(Thanks Seth Herd & Joe Carlsmith for critical comments on a draft.)