Worst-case thinking in AI alignment

Alternative title: “When should you assume that what could go wrong, will go wrong?”

Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits.

In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment.

I think there are several distinct reasons that this might be the right assumption to make in a particular situation. But I think people often conflate these reasons, and I think that this causes confusion and mistaken thinking. So I want to spell out some distinctions.

Throughout this post, I give a bunch of specific arguments about AI alignment, including one argument that I think I was personally getting wrong until I noticed my mistake yesterday (which was my impetus for thinking about this topic more and then writing this post). I think I’m probably still thinking about some of my object level examples wrong, and hope that if so, commenters will point out my mistakes. But I think I’ll stand by the claim that we should be attempting to distinguish between these classes of argument.

My list of reasons to maybe use worst-case thinking

Here’s an attempt at describing some different classes situations where you might want to argue that something goes as badly as it could.

You’re being optimized against

For example, if you’ve built an unaligned AI and you have a team of ten smart humans looking for hidden gotchas in its proposed actions, then the unaligned AI will probably come up with a way of doing something bad that the humans miss. In AI alignment, we most often think about cases where the AI we’re training is optimizing against us, but sometimes we also need to think about cases where other AIs or other humans are optimizing against us or our AIs.

In situations like this, I think Eliezer’s attitude is basically right: we’re being optimized against and so we have to use worst-case thinking and search hard for systems which we can strongly argue are infallible.

One minor disagreement: I’m less into hard takeoffs than he is, so I place less weight than he does on situations where your AI becomes superintelligent enough during training that it can exploit some kind of novel physics to jump an airgap or whatever. (Under my model, such a model probably just waits until it’s deployed to the internet–which is one of the first things that AGI developers want to do with it, because that’s how you make money with a powerful AI–and then kills everyone.)

But I fundamentally agree with his rejection of arguments of the form “only a small part of the space of possible AI actions would be devastatingly bad, so things will probably be fine”.

Scott Garrabrant writes about an argument like this here.

The space you’re selecting over happens to mostly contain bad things

When Hubinger et al argue in section 4.4 of Risks from Learned Optimization that “there are more paths to deceptive alignment than to robust alignment,” they aren’t saying that you get a misaligned mesa-optimizer because the base optimizer is trying to produce an agent that is as misaligned as possible, they’re saying that even though the base optimizer isn’t trying to find a misaligned policy, most policies that it can find are misaligned and so you’ll probably get one. But unlike the previous situation, if instead it was the case that 50% of the policies that SGD might find were aligned, then we’d have a 50% chance of surviving, because SGD isn’t optimizing against us.

I think that AI alignment researchers often conflate these two classes of arguments. IMO, when you’re training an AGI:

  • The AI will try to kill you if it’s misaligned. So if you remove some but not all strategies that any unaligned AI could use to get through your training process, you haven’t made much progress at all.

  • But SGD isn’t trying to kill you, and so if there exist rare misaligned models in the model space that could make it through the training process and then kill you, what matters is how common they are, not whether they exist at all. If you never instantiate the model, it never gets a chance to pervert your optimization process (barring crazy scenarios with acausal threats or whatever).

(I noticed that I was making a mistake related to mixing up these two classes on Sunday; I then thought about this some more and wrote this post.)

You want to solve a problem in as much generality as possible, and so you want to avoid making assumptions that might not hold

There’s a certain sense in which cryptographers make worst-case assumptions in their research. For example, when inventing public key cryptography, cryptographers were asking the question “Suppose I want to be able to communicate privately with someone, but an eavesdropper is able to read all messages that we send to each other. Is there some way to communicate privately regardless?”

Suppose someone responded by saying “It seems like you’re making the assumption that someone is spying on your communications all the time. But isn’t this unrealistically pessimistic?”

The cryptographer’s response would be to say “Sure, it’s probably not usually the case that someone is spying on my packets when I send messages over the internet. But when I’m trying to solve the technical problem of ensuring private communication, it’s quite convenient to assume a simple and pessimistic threat model. Either I’ll find an approach that works in any scenario less pessimistic than the one I solved, or I’ll learn that we actually need to ensure some other way that no-one’s reading my packets.”

Similarly, in the alignment case, sometimes we make pessimistic empirical assumptions when trying to specify settings for our problems, because solutions developed for pessimistic assumptions generalize to easier situations but the converse isn’t true.

As a large-scale example, when we talk about trying to come up with competitive solutions to AI alignment, a lot of the motivation isn’t the belief that there will be literally no useful global coordination around AI.

A smaller-scale example: When trying to develop schemes for relaxed adversarial training, we assume that we have no access to any interpretability tools for our models. This isn’t because we actually believe that we’ll have no interpretability tools, it’s because we’re trying to develop an alternative to relying on interpretability.

This is kind of similar to the attitude that cryptographers have.

Aiming your efforts at worlds where you have the biggest marginal impact

Suppose you are unsure how hard the alignment problem is. Maybe you think that humanity’s odd’s of success are given by a logistic function of the difference between how much alignment progress was made and how hard the problem is. When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.

Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.

And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected. This is the point that Eliezer makes in Security Mindset and the Logistic Success Curve: the security-minded character thinks that it’s so unlikely that a particular security-lax project will succeed at building a secure system that she doesn’t think it’s worth her time to try to help them make marginal improvements to their security.

And so, this kind of thinking only pushes you to aim your efforts at surprisingly bad worlds if you’re already P(doom) < 50%.

This type of thinking is common among people who are thinking about global catastrophic biological risks. I don’t know of any public documents that are specifically about this point, but you can see an example of this kind of reasoning in Andrew Snyder-Beattie’s Peak defence vs trough defence in biosecurity.


Sometimes a problem involves a bunch of weird things that could go wrong, and in order to get good outcomes, it has to be the case that all of them go well. For example, I don’t think that “a terrorist infiltrates the team of labellers who are being used to train the AGI and poisons the data” is a very likely AI doom scenario. But I think there are probably 100 scenarios as plausible as that one, each of which sounds kind of bad. And I think it’s probably worth some people’s time to try to stamp out all these individually unlikely failure modes.

Planning fallacy

Ryan Greenblatt notes that you can also make a general reference class claim that people are too optimistic (planning fallacy etc.).

Differences between these arguments

Depending on which of these arguments you’re making, you should respond very differently when someone says “the thing you’re proposing is quite far fetched”.

  • If the situation involves being optimized against, you say “I agree that that action would be quite a weird action among actions. But there’s a powerful optimization process selecting for actions like that action. So I expect it to happen anyway. To persuade me otherwise, you need to either claim that there isn’t adversarial selection, or that bad actions either don’t exist or are so hard to find that an adversary won’t possibly be able to find them.”

  • If you think that the situation involves a random process selecting over a space that is almost all bad, then you should say “Actually I disagree, I think that in fact the situation we’re talking about is probably about as bad as I’m saying; we should argue about what the distribution actually looks like.”

  • If you are making worst-case assumptions as part of your problem-solving process, then you should say “I agree that this situation seems sort of surprisingly bad. But I think we should try to solve it anyway, because solving it gives us a solution that is likely to work no matter what the empirical situation turns out to be, and I haven’t yet been convinced that my pessimistic assumptions make my problem impossible.”

  • If you’re making worst-case assumptions because you think that P(doom) is low and you are focusing on scenarios you agree are worse than expected, you should say “I agree that this situation seems sort of surprisingly bad. But I want to work on the situations where I can make the biggest difference, and I think that these surprisingly bad situations are the highest-leverage ones to work on.”

  • If you’re engaging in Murphyjistu, you should say “Yeah this probably won’t come up, but it still seems like a good idea to try and crush all these low-probability mechanisms by which something bad might happen.”

Mary Phuong proposes breaking this down into two questions:

  • When should you believe things will go badly, because they in fact will go badly? (you’re being optimized against, or the probability of badness is high for some other reason)

  • When should you focus your efforts on worlds where things go badly? I.e. it’s about which parts of the distribution you intervene on, rather than an argument about what the distribution looks like.