Why deceptive alignment matters for AGI safety

Update: I changed the title from “Why AGI safety researchers should focus mostly on deceptive alignment” to “Why deceptive alignment matters for AGI safety”. I think my original message was too strong and I’m actually much more uncertain about failure modes from AGI than the title suggests.

Comment: after I wrote the first draft of this post, Evan Hubinger published “How likely is deceptive alignment” in which he argues that deceptive alignment is the default outcome of NNs trained with SGD. The post is very good and I recommend everyone to read it. As a consequence, I rewrote my post to cover less of the “why is deceptive alignment likely” to the more high-level arguments of “why should we focus on deceptive alignment” and adopted Evan’s nomenclature to prevent confusion.

I’d like to thank Lee Sharkey, Richard Ngo and Evan Hubinger for providing feedback on a draft of this post.

TL;DR: No matter from which angle I look at it, I always arrive at the conclusion that deceptive alignment is either a necessary component or greatly increases the harm of bad AI scenarios. This take is not new and many(most?) people in the alignment community seem to already believe it but I think there are reasons to write this post anyway.
Firstly, newer members of the alignment community are sometimes not aware (at least I wasn’t in the beginning) that more senior people often implicitly talk about deceptive alignment when they talk about misalignment.

Secondly, there seems to be some disagreement about whether a powerful misaligned AI is deceptive by default or whether such a thing as a “corrigibly aligned” AI can even exist. I hope this post clarifies the different positions.

Epistemic status: Might have reinvented the wheel. Most of the content is probably not new for most people within the alignment community. Hope it is helpful anyway.

Update: after a discussion in the comments, I want to make some clarifications:
1. My definition of deception is a bit inconsistent throughout the post. I’m not sure what the best definition is but I think it is somewhere between “We have no clue what the model is doing” (which doesn’t include active deception) to “The model is actively trying to hide something from us”. Both seem like important failure modes.
2. I don’t think deception is orthogonal to understanding other failure modes like getting what we measure. Deception can be a component of other failure modes.
3. This post should not be interpreted as “everything that isn’t direct work on deception is bad” and more like “we should think about how other research relates to deception”. For example, AI forecasting still seems super valuable to me. However, I think one of the main sources of value from AI forecasting comes from having better models of future AI capabilities and those might be used to predict when the model becomes deceptive and what happens if it does.
4. I’m not sure about all of this. I still find many aspects of alignment confusing and hard to grasp but I’m mildly confident in the statement “most failure modes look much worse when you add deception” and thus my takeaway is something like “deception is not everything but probably a good thing to work on right now”.

Definition—deceptive alignment

By deceptive alignment, I mean an AI system that seems aligned to human observers and passes all relevant checks but is, in fact, not aligned and ultimately aims to achieve another non-aligned goal. In Evan’s post, this means that the NN has actively made an incomplete proxy of the true goal a terminal goal. Note, that the AI is aware of the fact that we wanted it to achieve a different goal and therefore actively acts in ways that humans will perceive as aligned. If the AI accidentally followed a different goal, e.g. due to a misunderstanding or a lack of capabilities, this is not deceptive alignment but described as corrigible alignment in Evan’s post. Corrigible alignment essentially means that the AI currently has an incorrect understanding of the world or our goals because it learned a wrong proxy but the proxy is not a terminal goal, i.e. the AI has no stake in preserving it and we could correct it once it is detected.

In other words, for deception to be at play, I assume that the AI is actively adversarial but pretends not to be.

Note that this other goal doesn’t have to be meaningful or special, the deceptively aligned AI could be a paperclip maximizer pretending to be a sophisticated policy recommendation system or a sophisticated policy recommendation system pretending to be a paperclip maximizer.

Important crux—is every powerful misaligned AI deceptively aligned?

It seems unclear whether the distinction between corrigible alignment and deceptive alignment makes sense. Potentially, deception is just a function of capabilities and initial goals, e.g. once the model is capable enough to understand that it has goals, that it is being trained to achieve them and that it currently has goals that differ from the intended goals (e.g. because it learned the wrong proxy), it automatically tries to become deceptive for instrumental reasons. In other words, if the AI is capable and doesn’t learn the right goal early in the training process, it is likely to be deceptive by default. Richard Ngo, for example, argues that “Once policies can reason about their training processes and deployment contexts, they’ll learn to deceptively pursue misaligned goals while still getting high training reward.” in this paper.

In Risks from Learned Optimization, Evan Hubinger writes: “Even a robustly aligned mesa-optimizer that meets the criteria [for deceptive alignment] is incentivized to figure out the base objective in order to determine whether or not it will be modified since before doing so it has no way of knowing its own level of alignment with the base optimizer.” In a draft of this post, he commented: “Any proxy-aligned model that meets the criteria for deceptive alignment—most notably that cares about something in the world over time—will want to be deceptive unless it is perfectly confident that it is perfectly aligned.”. Thus, it is possible that powerful models that don’t fulfill these criteria are corrigibly aligned, i.e. there are some forms of (mis-)alignment that are not deceptive by default.

There seems to be some agreement on the ends of the spectrum, e.g. both sides of the debate would probably agree that very powerful and situationally aware models are deceptive for instrumental reasons. They would likely also agree that for very bad models the category of alignment doesn’t make that much sense to begin with, e.g. an MNIST classifier that doesn’t have high accuracy is not “misaligned”, it’s just a bad model.

I think there are multiple possible explanations for this kind of disagreement. They include

  1. Not all powerful misaligned models are deceptive: Maybe not all powerful models are automatically deceptively misaligned, e.g. because the instrumental incentive doesn’t always hold true. Maybe, sometimes honesty or cooperation is the most rational way to maximize a goal for the AI. I’m not sure there is such a case but I’m willing to be persuaded. Alternatively, “very powerful just doesn’t make a statement about the complexity of the goal. For example, a very powerful language model might still “only” care about predicting the next word and potentially the incentives to become deceptive are just not very strong for next-word prediction.

  2. Not all situationally aware models are deceptive: Maybe not all situationally aware models are automatically deceptive. For example, we could imagine a model that maintains a lot of uncertainty about its goals yet is situationally aware (suggested by Lee Sharkey). Evan provided the example of a model that is situationally aware and very competent but for some reason only cares about the reward at any given point in time and not about future rewards. Thus, it could be arbitrarily competent and situationally aware but not deceptive. Evan further pointed out that some people might not call such a short-sighted model situationally aware.

I am personally uncertain about whether all sufficiently powerful and situationally aware models are automatically deceptive. My rough beliefs are:

  • At the lower end of the capability spectrum, it doesn’t make that much sense to talk about alignment in the first place. If an MNIST classifier is not good at classifying digits, it’s just bad. I think the concept of alignment implicitly carries an intuition that the model is at least somewhat capable.

  • At the upper end of the capability spectrum, models seem to mostly fall in the “internally aligned” or “deceptively aligned” category. I think it is implausible that there is a very powerful and situationally aware model that does not become deceptive for instrumental reasons (unless you have a very low bar for what counts as situationally aware). Even for the simple goal of next-word prediction, deception seems like a rational choice at some level of capabilities. Maybe there are some exceptions to this but I currently think the default will be deceptive alignment (unless we make substantial gains in alignment techniques).

  • I find it plausible that there are somewhat powerful models that we would call corrigibly aligned. I think current LLMs could already be at the lower end of this spectrum. They already work too well to be just called bad models but they also have too many flaws to be called internally aligned and they are probably not deceptive.

The case

In this section, I want to argue why deceptive alignment is an especially important component of AGI safety and that the AGI safety community should, therefore, prioritize it more than we currently do.

In case you think that all forms of misalignment are necessarily deceptive, you probably don’t have to read the rest of this post.

Scale

My current impression is that, in the long run, the biggest harms (x-risk, s-risk or other) come from huge power asymmetries between agents with different goals, e.g. humanity vs. a misaligned AGI. Effectively, a very powerful agent can just say “I want this thing and you can’t stop me” and then do it. However, deceptive alignment seems to be relevant for these power asymmetries to arise in two key ways.

Firstly, the way in which large power asymmetries are achieved is often through deceptive alignment. In many scenarios in which an insufficiently (for catastrophic risks) powerful agent gains more power, it does so by pretending to be good for sufficiently long to gain more power in the meantime, e.g. by collecting resources or setting up structures for future take-over. Once it has enough power, it starts to pursue its own goals which leads to disaster. The deception was necessary to gain from an insufficiently powerful (more or less harmless) state to a sufficiently powerful (pretty harmful) state. Had the AI revealed its true intentions in the beginning, it would likely not have been able to amass that many resources.

Secondly, with deceptive alignment, the upper bound of harm from one misaligned AI system seems to be much higher than from corrigibly aligned systems. A deceptively aligned AI can effectively just wait, observe its observers, collect resources whenever possible, and so forth, once it has convinced the observers that it is aligned. The longer it waits to strike the higher the potential damage. When it waits long enough, the damage might be really big, e.g. extinction or anti-utopia.

In general, everything looks worse with deceptive alignment. Most, if not all, scenarios of AI going badly are worse when you add deceptive alignment, e.g. if the goal of the other entity is to fool you about their goals in addition to having these other goals. Let’s look at some examples

  • A near-term AI system was given the goal to help with a problem, e.g. finding a cure to a disease. For some reason, the goal that the AI actually pursues is a different one, e.g. it wants to collect paperclips. In one world, a corrigibly misaligned AI starts to collect paperclips and is immediately stopped. On closer inspection, it turns out that something went wrong during the training process and the AI learned the wrong goal.
    In a second world, a deceptive AI starts to pretend to work on a cure for the disease. It knows that humans think its goal is to work on the cure but it actually wants to collect more paperclips. Thus it creates an illusion of progress for the cure while collecting more and more resources in the background. The official reason for these resources is to work on the cure but ultimately it will use them to gain as many paperclips as possible. Which system will create more damage?

  • A country wants to use an AI to gain more power. This AI is misaligned with most human values because it wants to gain power for the country at the cost of other countries—in some sense, it is intentionally misaligned. In one world, the AI is deployed, bribes a bunch of people, suggests some policies to other countries but relatively quickly the other countries have figured out what’s going on.
    In the second world, the AI acts as if it was helping all countries but is pursuing actions to help its own country in the background. Which system will create more damage in the long run?

  • Ajeya Cotra has used the analogy of Saints (internally aligned), Syphocants (corrigibly aligned) and Schemers (deceptively aligned) to showcase the potential damage of deceptively aligned AI. Assume you run a company and your hiring pool includes the following types of people

    • Saints—people who genuinely just want to help you manage your estate well and look out for your long-term interests.

    • Sycophants—people who just want to do whatever it takes to make you short-term happy or satisfy the letter of your instructions regardless of long-term consequences.

    • Schemers—people with their own agendas who want to get access to your company and all its wealth and power so they can use it however they want.

Which one do you think creates more damage to your company in the long run?

Neglectedness

I think corrigible alignment, e.g. an AI misunderstanding a specification or a goal being underspecified, will be a problem and will lead to damage. However, I think that conventional AI capabilities researchers will eventually have to address these problems while there is no reason for them to address deceptively misaligned AI before it is too late.

If your system doesn’t work as intended, e.g. because it doesn’t do what the customer wants or even creates some small-scale accidents, this directly hurts your profit margin. Thus, the AI capabilities company has an incentive to work on these problems directly and the error feedback is fast enough to be noticed. I think RLHF provides some evidence for this hypothesis, i.e. GPT-3 didn’t quite get what the customers wanted and thus OpenAI used RLHF to correct some of the wrong proxies learned by GPT-3.

Deceptive alignment on the other hand is likely harder to notice and feedback cycles might be much longer. For example, an AI could work fine for a couple of years before it suddenly starts to do very weird things. By this time, the AI company might not exist anymore or doesn’t feel responsible for the damage.

Since most research on deceptive alignment might sound a bit sci-fi to a non-safety-conscious person, it is hard or impossible to get funding or support to work on it in academia. Furthermore, the incentives of most academics are to write incremental and concrete papers. Thus, I don’t think most academics will work on deceptive alignment until it is incentivized which might be too late.

Additionally, detecting deceptive alignment is hard. It is likely a hard problem in general and we are currently very far away from having any tools to reliably detect it. Often people shy away from working on hard problems because they don’t feel rewarding and tend to imply failure. Thus, without a very strong motivation to work on deceptive alignment, the vast majority of people will work on simpler problems.

For all these reasons, I think that safety-conscious researchers have a special reason to work on deceptive alignment.

Urgency

I think most, if not all, short timeline x-risk scenarios (at least the ones I find plausible) contain deceptive alignment as a key component. I expect the current trend in capabilities to continue in small steps, e.g. there are no insane levels of capability differences between a model of size X vs. a model of size 10X (where X is e.g. the number of parameters or FLOP for the training run). Therefore, I don’t expect models to suddenly get so powerful that they could immediately overpower humanity. Thus, I don’t expect that early AGI systems with roughly human-like capabilities will be able to enforce their misaligned goal if they make it transparent or don’t actively hide it

In my opinion, the most plausible way for such an early system to lead to a catastrophe is by realizing that it is insufficiently powerful, pretending to be aligned and amassing resources in the background. A corrigibly misaligned system of equal power will likely be caught and corrected or stopped early on.

Therefore, if we want to prevent the catastrophic risks that are most likely to happen early, we should prioritize deceptive alignment.

Tractability

I expect deceptive alignment to be less tractable than other things in alignment, i.e. I expect that it is harder to make progress and to know whether you have made progress compared to most other AI safety research.

However, not all hope is lost. I think that the most obvious answer to the problem of deceptive systems is: “we just need to understand the systems better”. By this I mean we need to understand their goals, their world model, their behavior, their training procedure, and much more. I feel like we should be able to make tractable progress on these kinds of questions as can already be seen with mechanistic interpretability and other kinds of interpretability efforts.

Why deception is critical compared to corrigible misalignment

I think one possible answer to the above is that most of the harm comes from the fact that the model has a different goal rather than whether it is deceptively misaligned or accidentally misaligned. While I think that corrigible misalignment is a problem, I think the biggest risks come from deception.

The main reason for this is that in deceptive alignment the other entity is actively trying to fool you, i.e. to pick the action that maximizes your trust while it increases the goal of the other agent. To understand this intuition, consider the following examples.

  • What is harder? Finding something that you lost or something that someone else has actively hidden?

  • Assume you play a game of chess. You think the other person has the same goal as you, e.g. to take the others’ king. It looks like they follow this plan but after they took both of your rooks, they jump up and tell you that they won. Their goal was to pretend to play for the king but actually take both rooks. Which version is harder? The version of chess where both players try to take the opposite king or the version where you don’t know what the other player’s goal is?

  • I think there is probably a more general version of this for the complexity of different types of games. And my expectation would be that games, where the other player’s goal is unknown, are harder than games where the goals are known. Intuitively, when the goal is unknown, it’s much harder to prune the game tree and thus complexity explodes. I haven’t looked for a mathematical version of this, but there probably already is one.

I’m aware that this is not a proof but just a bunch of intuition pumps but I think the intuition already clarifies why I think deception is harder. There are more technical arguments about simplicity and speed in Evan’s post.

Related work

Many people in alignment have stated that they think deceptive alignment leads to the biggest risks or are actively working to solve them already. However, I feel like while this belief might be obvious to many more experienced members of the community, it is not common sense among newcomers (might just be my impression). Therefore, I thought it might be helpful to add this very explicit post to the list of works/​people that are less explicit about this assumption.

People who seem to think that deceptive alignment is very important include

  • Evan Hubinger: see e.g. this long post

  • Ajeya Cotra: I read her post “why alignment could be hard with modern DL” as saying that deception is a big problem (probably the biggest but not sure)

  • Paul Christiano (states that the biggest risk comes from deceptive alignment in this comment on OpenAI’s alignment strategy; Also, I think the ELK report already hints implicitly or explicitly at deception being the core problem)

  • Beth Barnes: states “Currently trying to figure out how we’ll know when we’re close to dangerous AI, and how to detect misalignment and deceptive alignment” on her website.

  • Anecdotally, some people who work on interpretability state deceptive alignment as the core problem they’re trying to solve (mostly personal chats, so not sure if I can share names).

  • Anecdotally, some people working at Redwood Research stated deceptive alignment as the core problem they’re trying to solve (mostly personal chats, so not sure if I can share names).

  • There are probably a ton of people that I forgot or simply don’t know of who think that deceptive alignment is important. In any case, lots of people who have thought about these topics a lot come to a roughly similar conclusion, namely that deceptive alignment is where most of the catastrophic risk comes from.

  • I think most senior people mean “deceptively misaligned” when they say “misaligned” but less senior people are sometimes not aware of this.

Implications

To be fair, I’m not very certain what the correct response to this insight is. The two implications I drew for myself are

  1. I should ask “how does this help with deceptive alignment?” before starting a new project. In retrospect, I haven’t done that a lot and I think most of my projects, therefore, have not contributed a lot to this question. I’ll do that more in the future.

  2. Mechanistic Interpretability and “understanding NNs” in general (sometimes coined Science of Deep Learning) seem like more relevant research directions under this framing.

I’m currently exploring a research agenda that addresses some aspects of deceptive alignment. I’ll publish some version of it when I have made enough progress.