Worst-case thinking in AI alignment

Buck23 Dec 2021 1:29 UTC

LW: 167 AF: 76

Alternative title: “When should you assume that what could go wrong, will go wrong?”

Thanks to Mary Phuong and Ryan Greenblatt for helpful suggestions and discussion, and Akash Wasil for some edits.

In discussions of AI safety, people often propose the assumption that something goes as badly as possible. Eliezer Yudkowsky in particular has argued for the importance of security mindset when thinking about AI alignment.

I think there are several distinct reasons that this might be the right assumption to make in a particular situation. But I think people often conflate these reasons, and I think that this causes confusion and mistaken thinking. So I want to spell out some distinctions.

Throughout this post, I give a bunch of specific arguments about AI alignment, including one argument that I think I was personally getting wrong until I noticed my mistake yesterday (which was my impetus for thinking about this topic more and then writing this post). I think I’m probably still thinking about some of my object level examples wrong, and hope that if so, commenters will point out my mistakes. But I think I’ll stand by the claim that we should be attempting to distinguish between these classes of argument.

My list of reasons to maybe use worst-case thinking

Here’s an attempt at describing some different classes situations where you might want to argue that something goes as badly as it could.

You’re being optimized against

For example, if you’ve built an unaligned AI and you have a team of ten smart humans looking for hidden gotchas in its proposed actions, then the unaligned AI will probably come up with a way of doing something bad that the humans miss. In AI alignment, we most often think about cases where the AI we’re training is optimizing against us, but sometimes we also need to think about cases where other AIs or other humans are optimizing against us or our AIs.

In situations like this, I think Eliezer’s attitude is basically right: we’re being optimized against and so we have to use worst-case thinking and search hard for systems which we can strongly argue are infallible.

One minor disagreement: I’m less into hard takeoffs than he is, so I place less weight than he does on situations where your AI becomes superintelligent enough during training that it can exploit some kind of novel physics to jump an airgap or whatever. (Under my model, such a model probably just waits until it’s deployed to the internet–which is one of the first things that AGI developers want to do with it, because that’s how you make money with a powerful AI–and then kills everyone.)

But I fundamentally agree with his rejection of arguments of the form “only a small part of the space of possible AI actions would be devastatingly bad, so things will probably be fine”.

Scott Garrabrant writes about an argument like this here.

The space you’re selecting over happens to mostly contain bad things

When Hubinger et al argue in section 4.4 of Risks from Learned Optimization that “there are more paths to deceptive alignment than to robust alignment,” they aren’t saying that you get a misaligned mesa-optimizer because the base optimizer is trying to produce an agent that is as misaligned as possible, they’re saying that even though the base optimizer isn’t trying to find a misaligned policy, most policies that it can find are misaligned and so you’ll probably get one. But unlike the previous situation, if instead it was the case that 50% of the policies that SGD might find were aligned, then we’d have a 50% chance of surviving, because SGD isn’t optimizing against us.

I think that AI alignment researchers often conflate these two classes of arguments. IMO, when you’re training an AGI:

The AI will try to kill you if it’s misaligned. So if you remove some but not all strategies that any unaligned AI could use to get through your training process, you haven’t made much progress at all.
But SGD isn’t trying to kill you, and so if there exist rare misaligned models in the model space that could make it through the training process and then kill you, what matters is how common they are, not whether they exist at all. If you never instantiate the model, it never gets a chance to pervert your optimization process (barring crazy scenarios with acausal threats or whatever).

(I noticed that I was making a mistake related to mixing up these two classes on Sunday; I then thought about this some more and wrote this post.)

You want to solve a problem in as much generality as possible, and so you want to avoid making assumptions that might not hold

There’s a certain sense in which cryptographers make worst-case assumptions in their research. For example, when inventing public key cryptography, cryptographers were asking the question “Suppose I want to be able to communicate privately with someone, but an eavesdropper is able to read all messages that we send to each other. Is there some way to communicate privately regardless?”

Suppose someone responded by saying “It seems like you’re making the assumption that someone is spying on your communications all the time. But isn’t this unrealistically pessimistic?”

The cryptographer’s response would be to say “Sure, it’s probably not usually the case that someone is spying on my packets when I send messages over the internet. But when I’m trying to solve the technical problem of ensuring private communication, it’s quite convenient to assume a simple and pessimistic threat model. Either I’ll find an approach that works in any scenario less pessimistic than the one I solved, or I’ll learn that we actually need to ensure some other way that no-one’s reading my packets.”

Similarly, in the alignment case, sometimes we make pessimistic empirical assumptions when trying to specify settings for our problems, because solutions developed for pessimistic assumptions generalize to easier situations but the converse isn’t true.

As a large-scale example, when we talk about trying to come up with competitive solutions to AI alignment, a lot of the motivation isn’t the belief that there will be literally no useful global coordination around AI.

A smaller-scale example: When trying to develop schemes for relaxed adversarial training, we assume that we have no access to any interpretability tools for our models. This isn’t because we actually believe that we’ll have no interpretability tools, it’s because we’re trying to develop an alternative to relying on interpretability.

This is kind of similar to the attitude that cryptographers have.

Aiming your efforts at worlds where you have the biggest marginal impact

Suppose you are unsure how hard the alignment problem is. Maybe you think that humanity’s odd’s of success are given by a logistic function of the difference between how much alignment progress was made and how hard the problem is. When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.

Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.

And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected. This is the point that Eliezer makes in Security Mindset and the Logistic Success Curve: the security-minded character thinks that it’s so unlikely that a particular security-lax project will succeed at building a secure system that she doesn’t think it’s worth her time to try to help them make marginal improvements to their security.

And so, this kind of thinking only pushes you to aim your efforts at surprisingly bad worlds if you’re already P(doom) < 50%.

This type of thinking is common among people who are thinking about global catastrophic biological risks. I don’t know of any public documents that are specifically about this point, but you can see an example of this kind of reasoning in Andrew Snyder-Beattie’s Peak defence vs trough defence in biosecurity.

Murphyjitsu

Sometimes a problem involves a bunch of weird things that could go wrong, and in order to get good outcomes, it has to be the case that all of them go well. For example, I don’t think that “a terrorist infiltrates the team of labellers who are being used to train the AGI and poisons the data” is a very likely AI doom scenario. But I think there are probably 100 scenarios as plausible as that one, each of which sounds kind of bad. And I think it’s probably worth some people’s time to try to stamp out all these individually unlikely failure modes.

Planning fallacy

Ryan Greenblatt notes that you can also make a general reference class claim that people are too optimistic (planning fallacy etc.).

Differences between these arguments

Depending on which of these arguments you’re making, you should respond very differently when someone says “the thing you’re proposing is quite far fetched”.

If the situation involves being optimized against, you say “I agree that that action would be quite a weird action among actions. But there’s a powerful optimization process selecting for actions like that action. So I expect it to happen anyway. To persuade me otherwise, you need to either claim that there isn’t adversarial selection, or that bad actions either don’t exist or are so hard to find that an adversary won’t possibly be able to find them.”
If you think that the situation involves a random process selecting over a space that is almost all bad, then you should say “Actually I disagree, I think that in fact the situation we’re talking about is probably about as bad as I’m saying; we should argue about what the distribution actually looks like.”
If you are making worst-case assumptions as part of your problem-solving process, then you should say “I agree that this situation seems sort of surprisingly bad. But I think we should try to solve it anyway, because solving it gives us a solution that is likely to work no matter what the empirical situation turns out to be, and I haven’t yet been convinced that my pessimistic assumptions make my problem impossible.”
If you’re making worst-case assumptions because you think that P(doom) is low and you are focusing on scenarios you agree are worse than expected, you should say “I agree that this situation seems sort of surprisingly bad. But I want to work on the situations where I can make the biggest difference, and I think that these surprisingly bad situations are the highest-leverage ones to work on.”
If you’re engaging in Murphyjistu, you should say “Yeah this probably won’t come up, but it still seems like a good idea to try and crush all these low-probability mechanisms by which something bad might happen.”

Mary Phuong proposes breaking this down into two questions:

When should you believe things will go badly, because they in fact will go badly? (you’re being optimized against, or the probability of badness is high for some other reason)
When should you focus your efforts on worlds where things go badly? I.e. it’s about which parts of the distribution you intervene on, rather than an argument about what the distribution looks like.

What links here?

Buck23 Dec 2021 1:29 UTC

LW: 167 AF: 76

18 comments6 min readLW link 2 reviews

Alex Flint 29 Jan 2023 16:32 UTC
LW: 6 AF: 3
0
AF
This is a relatively banal meta-commentary on reasons people sometimes give for doing worst-case analysis, and the differences between those reasons. The post reads like a list of things with no clear through-line. There is a gesture at an important idea from a Yudkowsky post (the logistic success curve idea) but the post does not helpfully expound that idea. There is a kind of trailing-off towards the end of the post as things like “planning fallacy” seem to have been added to the list with little time taken to place them in the context of the other things on the list. In the “differences between these arguments” section, the post doesn’t clearly elucidate deep differences between the arguments, it just lists verbal responses that you might make if you are challenged on plausibility grounds in each case.

Overall, I felt that this post under-delivered on an important topic.
Raemon 15 Jan 2023 8:56 UTC
LW: 2 AF: 1
0
AF
This piece took an important topic that I hadn’t realized I was confused/muddled about, convinced me I was confused/muddled about it, while simultaneously providing a good framework for thinking about it. I feel like I have a clearer sense of how Worst Case Thinking applies in alignment.
I also appreciated a lot of the comments here that explore the topic in more detail.

johnswentworth 23 Dec 2021 3:47 UTC
LW: 46 AF: 22
1
AF
A few more reasons...
First: why do software engineers use worst-case reasoning?
- A joking answer would be “the users are adversaries”. For most software this isn’t literally true; the users don’t want to break the software. But users are optimizing for things, and optimization in general tends to find corner cases. (In linear programming, for instance, almost all objectives will be maximized at a literal corner of the set allowed by the constraints.) This is sort of like “being optimized against”, but it emphasizes that the optimizer need not be “adversarial” in the intuitive sense of the word in order to have that effect.
- Users do a lot of different things, and “corner cases” tend to come up a lot more often than a naive analysis might think. If a user is weird in one way, they’re more likely to be weird in another way too. This is sort of like “the space contains a high proportion of bad things”, but with more emphasis on the points in the space being weighted in ways which weight Weirdness more than a naive analysis would suggest.
- Software engineers often want to provide simple, predictable APIs. Error cases (especially unexpected error cases) make APIs more complex.
- In software, we tend to have a whole tech stack. Even if each component of the stack fails only rarely, overall failure can still be extremely common if there’s enough pieces any one of which can break the whole thing. (I worked at a mortgage startup where this was a big problem—we used a dozen external APIs which were each fine 95+% of the time, but that still meant our app was down very frequently overall.) So, we need each individual component to be very highly reliable.
And one more, generated by thinking about some of my own use-cases:
- Unknown unknowns. Worst-case reasoning forces people to consider all the possible failure modes, and rule out any unknown unknowns.
These all carry over to alignment pretty straightforwardly.
Ajeya Cotra 23 Dec 2021 5:15 UTC
LW: 33 AF: 15
0
AF
Just want to draw out and highlight something mentioned in passing in the “You want to solve a problem in as much generality as possible...” section. Not only would it be great if you could solve a problem in the worst case, the worst case assumption is also often radically easier to think about than trying to think about realistic cases. In some sense the worst case assumption is the second-simplest assumption you could possibly make about the empirical situation (the simplest being the best case assumption—“this problem never comes up”). My understanding is that proving theorems about average case phenomena is a huge pain and often comes much after proofs about the worst case bounds.
- jsteinhardt 24 Dec 2021 17:10 UTC
  LW: 13 AF: 8
  0
  AF Parent
  I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don’t think it’s correct to call it the “second-simplest solution”, since there are many choices of what facet of the environment is worst-case.
  One keyword for this is “partial specification”, e.g. here is a paper I wrote that makes a minimal set of statistical assumptions and worst-case assumptions everywhere else: https://arxiv.org/abs/1606.05313. (Unfortunately the statistical assumptions are not really reasonable so the method was way too brittle in practice.) This kind of idea is also common in robust statistics. But my take would not be that it is simpler—in general it is way harder than just working with the empirical distribution in front of you.
  - davidad 27 Dec 2021 11:45 UTC
    LW: 6 AF: 2
    0
    AF Parent
    My interpretation of the NFL theorems is that solving the relevant problems under worst-case assumptions is too easy, so easy it’s trivial: a brute-force search satisfies the criterion of worst-case optimality. So, that being settled, in order to make progress, we have to step up to average-case evaluation, which is harder.
    (However, I agree that once we already need to do some averaging, making explicit and stripping down the statistical assumptions and trying to get closer to worst-case guarantees—without making the problem trivial again—is harder than just evaluating empirically against benchmarks.)
    What links here?
    Noosphere89's comment on A shot at the diamond-alignment problem by TurnTrout (26 Dec 2024 16:11 UTC; 8 points)
    Noosphere89's comment on What are the “no free lunch” theorems? by Vishakha (4 Feb 2025 2:52 UTC; 6 points)
    Noosphere89's comment on Paradigms for computation by Cole Wyeth (5 Jul 2025 18:36 UTC; 5 points)
    Noosphere89's comment on LLM Daydreaming (gwern.net) by Noosphere89 (21 Jul 2025 22:07 UTC; 4 points)
    Noosphere89's comment on .CLI’s Shortform by .CLI (1 Sep 2024 17:34 UTC; 2 points)
    Noosphere89's comment on Six Plausible Meta-Ethical Alternatives by Wei Dai (28 Jan 2025 14:56 UTC; 2 points)
    - jsteinhardt 27 Dec 2021 16:09 UTC
      LW: 2 AF: 1
      0
      AF Parent
      Finding the min-max solution might be easier, but what we actually care about is an acceptable solution. My point is that the min-max solution, in most cases, will be unacceptably bad.
      
      And in fact, since min_x f(theta,x) ⇐ E_x[f(theta,x)], any solution that is acceptable in the worst case is also acceptable in the average case.
      - davidad 28 Dec 2021 9:02 UTC
        LW: 6 AF: 2
        0
        AF Parent
        Agreed—although optimizing for the worst case is usually easier than optimizing for the average case, satisficing for the worst case is necessarily harder (and, in ML, typically impossible) than satisficing for the average case.
  - paulfchristiano 28 Dec 2021 16:29 UTC
    LW: 2 AF: 2
    0
    AF Parent
    I think this probably depends on the field. In machine learning, solving problems under worst-case assumptions is usually impossible because of the no free lunch theorem. You might assume that a particular facet of the environment is worst-case, which is a totally fine thing to do, but I don’t think it’s correct to call it the “second-simplest solution”, since there are many choices of what facet of the environment is worst-case.
    Even in ML it seems like it depends on how you formulated your problem/goal. Making good predictions in the worst case is impossible, but achieving low regret in the worst case is sensible. (Though still less useful than just “solve existing problems and then try the same thing tomorrow,” and generally I’d agree “solve an existing problem for which you can verify success” is the easiest thing to do.) Hopefully having your robot not deliberately murder you is a similarly sensible goal in the worst case though it remains to be seen if it’s feasible.
- davidad 27 Dec 2021 11:25 UTC
  LW: 9 AF: 5
  0
  AF Parent
  To elaborate this formally,
  - ${max}_{θ} {max}_{x} f (θ, x)$ is best-case
  - ${max}_{θ} {min}_{x} f (θ, x)$ is worst-case
  - ${max}_{θ} E_{x} f (θ, x)$ is average-case
  $max$ and $min$ are both “easier” monoids than $E$ essentially because of dominance relations; for any $θ$ , there’s going to be a single $x$ that dominates all others, in the sense that all other $x^{'} \neq x$ can be excluded from consideration and have no impact on the outcome. Whereas when calculating $E$ , the only $x^{'}$ that can be excluded are those outside the distribution’s support.
  $max$ is even easier than $min$ because it commutes with the outer $max$ ; not only is there a single $x$ that dominates all others, it doesn’t necessarily even depend on $θ$ (the problem can be solved as ${max}_{x} {max}_{θ} f (θ, x)$ or ${max}_{θ, x} f (θ, x)$ ). As a concrete example, the best case for nearly any sorting algorithm is already-sorted input, whereas the worst case depends more on which algorithm is being examined.
  What links here?
  - davidad's comment on Worst-case thinking in AI alignment by Buck (27 Dec 2021 12:18 UTC; 6 points)
jsteinhardt 24 Dec 2021 17:27 UTC
LW: 7 AF: 4
0
AF
Thanks! I appreciated these distinctions. The worst-case argument for modularity came up in a past argument I had with Eliezer, where I argued that this was a reason for randomization (even though Bayesian decision theory implies you should never randomize). See section 2 here: The Power of Noise.
Re: 50% vs. 10% vs. 90%. I liked this illustration, although I don’t think your argument actually implies 50% specifically. For instance if it turns out that everyone else is working on the 50% worlds and no one is working on the 90% worlds, you should probably work on the 90% worlds. In addition:
* It seems pretty plausible that the problem is overall more tractable in 10% worlds than 50% worlds, so given equal neglectedness you would prefer the 10% world.
* Many ideas will generalize across worlds, and recruitment / skill-building / organization-building also generalizes across worlds. This is an argument towards working on problems that seem tractable and relevant to any world, as long as they are neglected enough that you are building out distinct ideas and organizational capacity (vs. just picking from the same tree as ML generally). I don’t think that this argument dominates considerations, but it likely explains some of our differences in approach.
In the terms laid out in your post, I think my biggest functional disagreement (in terms of how it affects what problems we work on) is that I expect most worst-case assumptions make the problem entirely impossible, and I am more optimistic that many empirically-grounded assumptions will generalize quite far, all the way to AGI. To be clear, I am not against all worst-case assumptions (for instance my entire PhD thesis is about this) but I do think they are usually a source of significant added difficulty and one has to be fairly careful where they are making them.
For instance, as regards Redwood’s project, I expect making language models fully adversarially robust is impossible with currently accessible techniques, and that even a fairly restricted adversary will be impossible to defend against while maintaining good test accuracy. On the other hand I am still pretty excited about Redwood’s project because I think you will learn interesting things by trying. (I spent some time trying to solve the unrestricted adversarial example competition, totally failed, but still felt it was a good use of time for similar reasons, and the difficulties for language models seem interestingly distinct in a way that should generate additional insight.) I’m actually not sure if this differs that much from your beliefs, though.
davidad 27 Dec 2021 12:18 UTC
LW: 6 AF: 4
0
AF
Somewhere between worst-case and average-case performance is quantile-case performance, known in SRE circles as percentile latency and widely measured empirically in practice (but rarely estimated in theory). Formally, optimizing $p$ -quantile-case performance looks like ${max}_{θ} sup {k | E_{x} [[f (θ, x) > k]] > p}$ (compare to my expressions below for other cases). My impression is that quantile-case is heavily underexplored in theoretical CS and also underused in ML, with the exceptions of PAC learning and VC theory.
- davidad 27 Dec 2021 13:26 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Here’s the results of an abbreviated literature search for papers that bring quantile-case concepts into contact with contemporary RL and/or deep learning:
  - Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning. Christoph Dann, Tor Lattimore, Emma Brunskill. NIPS 2017.
    Defines a concept of “Uniform-PAC bound”, which is roughly when $p$ -quantile-case episodic regret scales polynomially in $1 / (1 - p)$ .
    Proves that a Uniform-PAC bound implies:
    PAC bound
    Uniform high-probability regret bound
    Convergence to zero regret with high probability
    Constructs an algorithm, UBEV, that has a Uniform-PAC bound
    Empirically compares quite favorably to other algorithms with only PAC or regret bounds
  - Policy Certificates: Towards Accountable Reinforcement Learning. Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill. ICML 2019.
    Defines an even stronger concept of “IPOC bound”, which implies Uniform-PAC, and also outputs a certified per-episode regret bound along with each proposed action.
    Constructs an algorithm ORLC that has an IPOC-bound
    Empirically compares favorably to UBEV
  - Revisiting Generalization for Deep Learning: PAC-Bayes, Flat Minima, and Generative Models. Gintare Dziugaite. December 2018 PhD thesis under Zoubin Ghahrmani.
  - Lipschitz Lifelong Reinforcement Learning. Erwan Lecarpentier, David Abel, Kavosh Asadi, et al. AAAI 2021.
    Defines a pseudometric on the space of all MDPs
    Proves that the mapping from an MDP to its optimal Q-function is (pseudo-)Lipschitz
    Uses this to construct an algorithm LRMax that can transfer-learn from past similar MDPs while also being PAC-MDP
  - Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. Jiafan He, Dongruo Zhou, Quanwaun Gu. NIPS 2021.
    Constructs an algorithm FLUTE that has a Uniform-PAC bound with a certain linearity assumption on the structure of the MDP being learned.
  - Beyond No Regret: Instance-Dependent PAC RL. Andrew Wagenmaker, Max Simchowitz, Kevin Jamieson. August 2021 preprint.
  - Learning PAC-Bayes Priors for Probabilistic Neural Networks. María Pérez-Ortiz, Omar Rivasplata, Benjamin Guedj, et al. September 2021 preprint.
  - Tigheter Risk Certificates for Neural Networks. María Pérez-Ortiz, Omar Rivasplata, John Shawe-Taylor, Csaba Szepesvári. ICML 2021.
  - PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, Andreas Krause. ICML 2021.
  - Michaël Trazzi 27 Dec 2021 22:22 UTC
    LW: 3 AF: 3
    0
    AF Parent
    I would add How useful is quantilization for specification gaming? Ryan Carey
Vaniver 23 Dec 2021 15:17 UTC
LW: 3 AF: 2
0
AF
When you’re considering between a project that gives us a boost in worlds where P(doom) was 50% and projects that help out in worlds where P(doom) was 1% or 99%, you should probably pick the first project, because the derivative of P(doom) with respect to alignment progress is maximized at 50%.
Many prominent alignment researchers estimate P(doom) as substantially less than 50%. Those people often focus on scenarios which are surprisingly bad from their perspective basically for this reason.
And conversely, people who think P(doom) > 50% should aim their efforts at worlds that are better than they expected.
This section seems reversed to me, unless I’m misunderstanding it. If “things as I expect” are P(doom) 99%, and “I’m pleasantly wrong about the usefulness of natural abstractions” is P(doom) 50%, the first paragraph suggests I should do the “better than expected” / “surprisingly good” world, because the marginal impact of effort is higher in that world.
[Another way to think about it is surprising in the direction you already expect is extremizing, but logistic success has its highest derivative in the middle, i.e. is a moderating force.]
JamesFaville 18 Jun 2023 12:44 UTC
1 point
0
I like this post a lot! Three other reasons came to mind, which might be technically encompassed by some of the current ones but seemed to mostly fall outside the post’s framing of them at least.
Some (non-agentic) repeated selections won’t terminate until they find a bad thing
In a world with many AI deployments, an overwhelming majority of deployed agents might be unable to mount a takeover, but the generating process for new deployed agents might not halt until a rare candidate that can mount a takeover is found. More specifically, consider a world where AI progress slows (either due to governance interventions or a new AI winter), but people continue conducting training runs at a fairly constant level of sophistication. Suppose that for these state-of-the-art training runs that (i) there is only a negligible chance of finding a non-gradient-hacked AI that can mount a takeover or enable a pivotal act, but (ii) there is a tiny but nonnegligible chance of finding a gradient hacker that can mount a takeover.^[1] Then eventually we will stumble across an unlikely training run that produces a gradient hacker.

This problem mostly seems like a special case of You’re being optimised against, though here you are not optimised against by an agent, but rather by the nature of the problem. Alternatively, this example could be lumped into The space you’re selecting over happens to mostly contain bad things if we either (i) reframe the space under consideration from “deployed AIs” to “AIs capable of mounting a takeover” (h/t Thomas Kehrenberg), or (ii) reframe The space you’re selecting over happens to mostly contain bad things to The space you’re selecting over happens to mostly contain bad things, relative to the number of selections made. But I think the fact that a selection may not terminate until a bad thing has been found is an important thing to pay attention to when it comes up, and weakly think it’d be useful to have a separate conceptual handle for it.

Aiming your efforts at worst-case scenarios
As long as some failure states are worse than others, optimising for the satisfaction of a binary success criterion won’t generally be sufficient to maximise your marginal impact. Instead, you should target worlds based in part on how bad failure within them would be, along with the change in success probability for a marginal contribution. For example, maybe many low P(doom) worlds are such because intent-aligning AI turns out to be pretty straightforward in them. But easy intent-alignment may imply higher misuse risk, such that if misuse risk is more concerning than accident risk then contributing towards solving alignment problems in ways robust to misuse may remain very high impact in easy-intent-alignment worlds.^[2]

One alternative way to state this consideration is that in most domains, there are actually multiple overlapping success criteria. Sometimes the more easily satisfied ones will be much higher-priority to target—even if your marginal contributions result in smaller changes to the odds of satisfying them—because they are more important.

This consideration is the main reason I prioritise worst-case AI outcomes (i.e. s-risks) over ordinary x-risk from AI.

Some bad things might be really bad
In a similar vein, for The space you’re selecting over happens to mostly contain bad things, it’s not the raw probability of selecting a bad thing that matters, but the product of that with the expected harm of a bad thing. Since some bad things are Really Very Terrible, sometimes it will make sense to use worst-case assumptions even when bad things are quite rare, as long as the risk of finding one isn’t Pascalian. I think the EU of an insecure selection is at particular risk of being awful whenever the left tail of the utility distribution of things you’re selecting for is much thicker than the right.
1. ^
  This is plausible to me because gradient-hacking could yield a “sharp left turn”, taking us very OOD for the sort of models runs had previously been producing. Some other sharp left turn candidates should work just as well in this example.
2. ^
  This is an interesting example, because in low P(doom) worlds of this sort marginal efforts to advance intent-alignment seem more likely to be harmful. If that were the case, alignment researchers would want to prioritise developing techniques that differentially help align AI to widely endorsed values rather than to the intent of an arbitrary deployer. Efforts to more directly intervene to prevent misuse would also look pretty valuable.
  
  But because of effects like these, it’s not obvious that you would want to prioritise low P(doom) worlds even if you were convinced that failure within them was worse than in high P(doom) worlds, since advancing-intent-alignment interventions might be helpful in most other worlds where it might be harder for malevolent users to make use of them. (And it’s definitely not apparent to me in reality that failure in low P(doom) worlds is worse than in high P(doom) worlds for this reason; I just thought this would make for a good example!)
JBlack 24 Dec 2021 0:36 UTC
1 point
0
For example, I don’t think that “a terrorist infiltrates the team of labellers who are being used to train the AGI and poisons the data” is a very likely AI doom scenario. But I think there are probably 100 scenarios as plausible as that one, each of which sounds kind of bad.
There are even much more likely scenarios which have the same basic mechanism and effect, such as “a disgruntled employee poisons the data”, “nation state operation”, “criminal group”, “software bug”, “one intern making an error”, or even “internet trolls for the lulz”. All of these have actually happened to corrupt data for important software projects in subtle and destructive ways.
jbash 23 Dec 2021 15:10 UTC
0 points
0
Defending the security mindset here! Without having even read the rest of the text yet...

It’s not (necessarily) about worst-case thinking. It’s more about making sure that you assign realistic probabilities to cases, taking into account the knowledge that intelligent agent may turn up and intentionally mess with things to make otherwise low probabilities higher.

One application of that is that you have to be exhaustive in enumerating all the possible cases, and can’t neglect any just because they seem weird or haven’t happened in the past. That does indeed often turn into identifying inobvious cases that may indeed be worst or approximately worst. But you are not required to assume the worst case in security, only to compensate for distortions of your determination of how likely that case actually is.

On edit: The next phase, of course, is to compensate for your likely failure to have identified all the cases to begin with...