Safe AI and moral AI

[Note: This post is an excerpt from a longer paper, written during the first half of the Philosophy Fellowship at the Center for AI Safety. This post is something of a companion piece to my deontological AI post; both were originally written as parts of a single paper. (There’s a small amount of overlap between the two.)]

1. Introduction

Two goals for the future development of AI stand out as desirable:

First, advanced AI should behave morally, in the sense that its decisions are governed by appropriately chosen ethical norms.
Second, advanced AI should behave safely, in the sense that its decisions shouldn’t unduly harm or endanger humans.

These two goals are often viewed as closely related. Here I’ll argue that, in fact, safe AI and morally aligned AI are importantly distinct targets, and that avoiding large-scale harms should take precedence in cases where the two conflict.

I’ll start by sketching frameworks for thinking about moral alignment and safety separately. I’ll then discuss how these properties can come apart, and I’ll give four reasons for prioritizing safety.

2. What morally aligned AI would be

Phrases like “morally aligned AI” have been used in many ways. Any system that deserved such a label would, I think, at least have to satisfy certain minimal conditions. I suggest the following. An AI is morally aligned only if it possesses a set of rules or heuristics $M$ such that:

[Applicability] Given an arbitrary prospective behavior in an arbitrary (real-world) context, $M$ can in principle determine how choiceworthy that behavior is in that context, and can in practice at least approximate this determination reasonably correctly and efficiently.
[Guidance] The AI’s behavior is guided to a large degree by $M$ . (E.g., in particular, if a behavior is strongly (dis)preferred by $M$ , the AI is highly (un)likely to select that behavior.)
[Morality] The rules or heuristics comprising $M$ have a good claim to being called moral. (E.g., because they issue from a plausible moral theory, or because they track common moral intuitions.)

Let me say a bit more on each of these points.

Re: [Applicability], there are two desiderata here. The first is the idea that an aligned AI should be able to morally evaluate almost any action it might take, not just a limited subset of actions. It’s easy to see why this is desirable. We expect an aligned AI to do the morally choiceworthy thing nearly all the time (or at least to have a clear idea of what’s morally choiceworthy, for the purposes of balancing morality against other considerations). If it can’t morally evaluate almost any prospective action, then it can’t reliably fulfill this expectation.^[1]

For similar reasons, it’s not enough that the AI has some evaluation procedure it could follow in theory. A procedure that takes a galaxy’s worth of computronium and a billion years to run won’t do much good if we expect aligned action on human timescales using modest resources—as we presumably will, at least for the foreseeable future. Even if the true moral algorithm is prohibitively costly to run, then, an aligned AI needs an approximation method that’s accurate, speedy and efficient enough for practical purposes.

Re: [Guidance], the idea is that alignment requires not just representations of moral choiceworthiness, but also action steered by these representations. I think it’s sensible to remain agnostic on whether an aligned AI should always choose the morally optimal action, or whether moral considerations might only be one prominent decision input among others. But the latter seems like the weakest acceptable condition: an AI that assigned weights of, say, $0.2$ to morality and $0.8$ to resource-use efficiency wouldn’t count as aligned.

Re: [Morality], the idea is that not just any set of action-guiding rules and heuristics are relevant to alignment; the rules must also have some sort of ethical plausibility. (An AI that assigned maximum choiceworthiness to paperclip production and always behaved accordingly might satisfy [Applicability] and [Guidance], but it wouldn’t count as morally aligned.)

I think there are many reasonable understandings of what ethical plausibility might amount to, and I want to cast a wide net. An AI could, for instance, instantiate [Morality] if it behaved in accordance with a widely endorsed (or independently attractive) moral theory, if it was trained to imitate commonsense human moral judgments, or if it devised its own (perhaps humanly inscrutable) moral principles by following some other appropriate learning procedure.

3. What safe AI would be

I said above that an AI counts as safe if its behavior doesn’t unduly harm or endanger humans (and, perhaps, other sentient beings). It’s of particular importance for safety that an AI is unlikely to cause an extinction event or other large-scale catastrophe.

Safety in this sense is conceptually independent of moral alignment. A priori, an AI’s behavior might be quite safe but morally unacceptable. (Imagine, say, a dishonest and abusive chatbot confined to a sandbox environment where it can only interact with a small number of researchers, who know better than to be bothered by its insults.) Conversely, an AI might conform impeccably to some moral standard—perhaps even to the true principles of objective morality—and yet be prone to unsafe behavior. (Imagine a consequentialist AI which sees an opportunity to maximize expected utility by sacrificing the lives of many human test subjects.)

The qualifier ‘unduly’ is important to the notion of safety. It would be a mistake to insist that a safe AI can never harm sentient beings in any way, under any circumstances. For one, it’s not clear what this would mean, or whether it would permit any activity on the AI’s part at all: every action causally influences many events in its future light cone, after all, and some of these events will involve harms in expectation. For another, I take it that safety is compatible with causing some kinds of harm. For instance, an AI might be forced to choose between several harmful actions, and it might scrupulously choose the most benign. Or it might occasionally cause mild inconvenience on a small scale in the course of its otherwise innocuous activities. An AI that behaved in such ways could still count as safe.

So what constitutes ‘undue’ harm? This is an important question for AI engineers, regulators and ethicists to answer, but I won’t address it here. For simplicity I’ll focus on especially extreme harms: existential risks which threaten our survival or potential as a species (“x-risks”), risks of cataclysmic future suffering (“s-risks”), and the like. An AI which is nontrivially likely to cause such harms should count as unsafe on anyone’s view.

You might wonder whether it makes sense to separate safety from moral considerations in the way I’ve suggested. A skeptical argument might run like this:

If an AI is morally aligned, then its acts are morally justifiable by hypothesis. And if its acts are morally justifiable, then any harms it causes are all-things-considered appropriate, however offputtingly large they may seem. It would be misguided to in any way denigrate an act that’s all-things-considered appropriate. Therefore it would be misguided to denigrate the behavior of a morally aligned AI by labeling it ‘unsafe’.

But this argument is mistaken for several reasons. Most obviously, the first premise is false. This is clear from the characterization of alignment in the previous section. While a morally aligned AI is guided by rules with a good claim to being called moral, these rules need not actually reflect objective morality. For instance, they might be rules of a popular but false moral theory. So moral justifiability (in some plausible moral framework) doesn’t entail all-things-considered acceptability.

The second premise is also doubtful. Suppose for the sake of argument that our AI is aligned with the true principles of objective morality, so that the earlier worries about error don’t apply. Even so, from the fact that an act is objectively morally justified, it doesn’t obviously follow that the act is ultima facie appropriate and rationally unopposable. As Dale Dorsey writes: “[T]he fact that a given action is required from the moral point of view does not by itself settle whether one ought to perform it, or even whether performing it is in the most important sense permissible… Morality is one way to evaluate our actions. But there are other ways, some that are just as important, some that may be more important” ([Dorsey 2016], 2, 4). For instance, we might legitimately choose not to perform a morally optimal act if we have strong prudential or aesthetic reasons against doing so.^[2]

Perhaps more importantly, even if objective moral alignment did entail all-things-considered rightness, we won’t generally be in a position to know that a given AI is objectively morally aligned. Our confidence in an AI’s alignment is upper-bounded by our confidence in the conjunction of several things, including: (1) the objective correctness of the rules or heuristics with which we aimed to align the AI; (2) the reliability of the process used to align the AI with these rules; (3) the AI’s ability to correctly apply the rules in concrete cases; and (4) the AI’s ability to correctly approximate the result of applying the rules in cases where it can’t apply them directly. It’s implausible that we’ll achieve near-certainty about all these things, at least in any realistic near-term scenario. So we won’t be able to use the skeptic’s reasoning to confidently defend any particular AI behavior. In particular, if an AI threatens us with extinction and we’re inclined to deem this bad, it will be at least as reasonable to question the AI’s successful moral alignment as to doubt our own moral judgments.

4. Safety first

On this picture of moral alignment and safety, the two outcomes can come apart, perhaps dramatically. In situations where they conflict, which should we prioritize? Is it better to have an impeccably moral AI or a reliably safe one?

Here are four reasons for putting safety first.

First, safety measures are typically reversible, whereas the sorts of extreme harms I’m concerned with are often irreversible. For instance, we can’t undo human extinction. And we won’t be able to stop an AI that gains a decisive advantage and uses its power to lock in a prolonged dystopian future. Even if you’re willing in principle to accept all the consequences of empowering a morally aligned AI, you should be at least a little uncertain about whether an AI that might take these actions is indeed acting on the correct moral principles. So, at the very least, you should favor safety until you’ve eliminated as much of your uncertainty as possible.
Second, as argued above, it’s unclear that what’s morally best must be all-things-considered best, or even all-things-considered permissible. Suppose it would be morally right for an AI to bring about the end of humanity. We might nevertheless have ultima facie compelling non-moral reasons to prevent this from happening: say, because extinction would prevent our long-term plans from coming to fruition, because our species’ perseverance makes for an incomparably great story, or because certain forms of biological or cognitive diversity have intrinsic non-moral value.^[3] In a similar vein, [Bostrom 2014] considers what ought to happen in a world where hedonistic consequentialism is true, and a powerful AI has the means to convert all human matter into pleasure-maximizing hedonium. Bostrom suggests that a small corner of the universe should be set aside for human flourishing, even if this results in slightly less overall value. “If one prefers this latter option (as I would be inclined to do) it implies that one does not have an unconditional lexically dominant preference for acting morally permissibly” (220).^[4]
Third, it’s possible that moral realism is false and there are no true moral principles with which to align AI. In this case, whatever (objective) reasons we’d have to obey some set of moral rules presumably wouldn’t be strong enough to outweigh our non-moral reasons for prioritizing safety. (If moral realism is false, then perhaps moral rules have something like the normative force of strong social conventions.) I think it’s reasonable to have some positive credence in moral antirealism. By contrast, it seems certain that we have e.g. prudential reasons to protect humanity’s future. This asymmetry favors safety.
Fourth, it’s conceivable that we’d have moral reason to protect humanity’s interests even against an AI which we took to be ethically exemplary. In “The Human Prejudice”, Bernard Williams has us imagine “benevolent and fairminded and farsighted aliens [who] know a great deal about us and our history, and understand that our prejudices are unreformable: that things will never be better in this part of the universe until we are removed” ([Williams 2006], 152). Should we collaborate with the aliens in our own eradication? If one thinks that morality begins and ends with universal principles applicable to all rational beings, and if one assumes that the aliens are much better than us at grasping these principles and other relevant facts, it’s hard to see what moral grounds we could give for resistance. But it would be right for us to resist (Williams thinks), so this conception of morality can’t be the whole story. Williams’ suggestion is that something like loyalty to humanity grounds a distinctive ethical imperative for us to defend our species’ interests, even when this conflicts with the demands of the best impartial moral system.^[5] On this sort of view, it wouldn’t be straightforwardly obligatory for us to submit to extinction or subjugation by an AI, no matter how impartially good, wise and knowledgeable we took the AI to be. I think a view along these lines is also worth assigning some credence.

Given a choice between moral-but-possibly-unsafe AI and safe-but-possibly-immoral AI, then, a variety of considerations suggest we should opt for the latter. (At least this is true until we have much more information and have thought much more carefully about our choices.)

To head off possible confusion, let me be clear about some things I’m not claiming.

It’s not my view that pursuing moral alignment is pointless, still less that it’s intrinsically harmful and a bad idea. There are excellent reasons to want AIs to behave morally in many scenarios. Many current approaches to moral alignment may be effective ways to achieve that goal; all are worth researching further.
It’s not my view that safety considerations always trump moral ones, regardless of their respective types or relative magnitudes. An AI that kills ten humans to achieve an extremely important moral goal (say, perfecting a technology that will dramatically improve human existence) would count as unsafe by many reasonable standards, but it doesn’t immediately follow on my view that we shouldn’t design such an AI. I claim only that safety considerations should prevail when sufficiently great risks of catastrophic harm are on the line.
It’s not my view that moral alignment methods couldn’t possibly produce safe behavior. On the contrary, the space of plausible moral rules is large, and it would be a surprise if it contained only principles that might jeopardize human survival.

5. An easy solution?

A final thought: suppose that $S$ is a set of rules and heuristics that implements your favorite collection of safety constrains. ( $S$ might consist of principles like “Never kill people”, “Never perform acts that cause more than $n$ dolors of pain”, or “Always obey instructions from designated humans”.) Now take an AI equipped with your preferred set of moral rules $M$ and add $S$ as a set of additional constraints, in effect telling the AI to do whatever $M$ recommends unless this would result in a relevant safety violation. (In these cases, the AI could instead choose its most $M$ -preferred safe option.) Wouldn’t such an AI be both safe and morally aligned by definition? And doesn’t this show that there’s a straightforward way to achieve safety via moral alignment, contrary to what I’ve claimed?

Unfortunately not. Finding a reasonable way to incorporate absolute prohibitions into a broader decision theory is a difficult problem about which much has been written (e.g. [Jackson & Smith 2006], [Aboodi et al. 2008], [Huemer 2010], [Lazar & Lee-Stronach 2019]). One tricky issue is risk. We want to prohibit our AI from performing unduly harmful acts, but how should we handle acts that merely have some middling risk of unsafe outcomes? A naive solution is to prohibit any behavior with a nonzero probability of causing serious harm. But virtually every possible act fits this description, so the naive method leaves the AI unable to act at all. If we instead choose some threshold t such that acts which are safe with probability $p > t$ are permitted, this doesn’t yet provide any basis for preferring the less risky or less harmful of two prohibited acts. (Given a forced choice between causing a thousand deaths and causing human extinction, say, it’s crucial that the AI selects the former.) Also, of course, any such probability threshold will be arbitrary, and sometimes liable to criticism for being either too high or too low.

Work on these issues continues, but no theory has yet gained wide acceptance or proven immune to problem cases. [Barrington MS] proposes five desiderata for an adequate account: “The correct theory will prohibit acts with a sufficiently high probability of violating a duty, irrespective of the consequences… but [will] allow sufficiently small risks to be justified by the consequences… It will tell agents to minimize the severity of duty violations… while remaining sensitive to small probabilities… And it will instruct agents to uphold higher-ranking duties when they clash with lower-ranking considerations” (12).

Some future account might meet these and other essential desiderata. What’s important for my purposes is that there’s no easy and uncontentious way to render an arbitrary moral theory safe by adding prohibitions on harmful behavior.

References

Aboodi, Ron, Adi Borer and David Enoch. 2008. “Deontology, individualism, and uncertainty: A reply to Jackson and Smith.” Journal of Philosophy 105, 259-272.

Barrington, Mitchell. MS. “Filtered maximization.”

Bostrom, Nick. 2014. Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press.

Bradley, Ben. 2001. “The value of endangered species.” Journal of Value Inquiry 35, 43-58.

Chisholm, Roderick. 1981. “Defining intrinsic value.” Analysis 41, 99-100.

Diamond, Cora. 2018. “Bernard Williams on the human prejudice.” Philosophical Investigations 41, 379-398.

Huemer, Michael. 2010. “Lexical priority and the problem of risk.” Pacific Philosophical Quarterly 91, 332-351.

Jackson, Frank and Michael Smith. 2006. “Absolutist moral theories and uncertainty.” Journal of Philosophy 103, 267-283.

Lazar, Seth and Chad Lee-Stronach. 2019. “Axiological absolutism and risk.” Noûs 53, 97-113.

Lemos, Noah. 1994. Intrinsic Value. Cambridge: Cambridge University Press.

Shulman, Carl and Nick Bostrom. 2021. “Sharing the world with digital minds.” In Steve Clarke, Hazem Zohny and Julian Savulescu (eds.), Rethinking Moral Status, Oxford: Oxford University Press, 306-326.

Williams, Bernard. 2006. Philosophy as a Humanistic Discipline (ed. by A.W. Moore). Princeton, NJ: Princeton University Press.

^
For concreteness, suppose the AI faces a choice between actions $A, B, C, D$ , but it can only evaluate $A$ (which it determines to be pretty good) and $B$ (which it determines to be pretty bad); its moral heuristics are silent about $C$ and $D$ . One thing the AI might do in this situation is disregard morality and choose between all four options on some other basis. This clearly won’t lead to reliably aligned behavior. Another strategy is to choose the best option from among those that are morally evaluable. But it’s possible that $C$ or $D$ is much better than $A$ , so choosing $A$ instead might be very bad.
^
An example of Dorsey’s illustrating the prudential case: Andrea can either move far away to attend Eastern Private College or stay home and go to Local Big State University. She’ll be able to provide important emotional support for her struggling family if and only if she chooses LBSU. There’s no corresponding moral reason to choose EPC, so morality demands that Andrea stay home. But Andrea has a strong prudential interest in attending EPC—it’s important to her long-held plans for the future—and so it would be all-things-considered appropriate for her to choose EPC.
^
This principle of bonum variationis is associated with Leibniz and Brentano. Its recent defenses include [Chisholm 1981], [Lemos 1994], [Scanlon 1998], [Bradley 2001].
^
This passage doesn’t explicitly identify the grounds on which Bostrom prefers continued human existence over moral rightness. A similar issue is raised in [Shulman & Bostrom 2021], with the same conclusion but a somewhat different rationale: here the view is that humans should go on existing in order “to hedge against moral error, to appropriately reflect moral pluralism, to account for game-theoretic considerations, or simply as a matter of realpolitik” (321).
^
For illuminating discussion of Williams’ views on this subject, see [Diamond 2018].