The Orthogonality Thesis is Not Obviously True

(Crosspost of this on my blog).

The basic case

If you really accept the practical version of the Orthogonality Thesis, then it seems to me that you can’t regard education, knowledge, and enlightenment as instruments for moral betterment.

—Scott Aaronson, explaining why he rejects the orthogonality thesis

It seems like in effective altruism circles there are only a few things as certain as death and taxes: the moral significance of shrimp, the fact that play pumps should be burned for fuel, and the orthogonality thesis. Here, I hope to challenge the growing consensus around the orthogonality thesis. A decent statement of the thesis is the following.

Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

I don’t think this is obvious at all. I think that it might be true, and so I am still very worried about AI risk, giving it about an 8% chance of ending humanity, but I do not, as many EAs do, take the orthogonality thesis for granted, as something totally obvious. To illustrate this, let’s take an example originally from Parfit, that Bostrom gives in one of his papers about the orthogonality thesis.

A certain hedonist cares greatly about the quality of his future experiences. With one exception, he cares equally about all the parts of his future. The exception is that he has Future-Tuesday-Indifference. Throughout every Tuesday he cares in the normal way about what is happening to him. But he never cares about possible pains or pleasures on a future Tuesday… This indifference is a bare fact. When he is planning his future, it is simply true that he always prefers the prospect of great suffering on a Tuesday to the mildest pain on any other day.

It seems that, if this ‘certain hedonist’ were really fully rational, they would start caring about their pleasures and pains equally across days. They would recognize that the day of the week does not matter to the badness of their pains. Thus, in a similar sense, if something is 10,000,000 times smarter than Von Neumann, and can think hundreds of thousands of pages worth of thoughts in the span of minutes, it would conclude that pleasure is worth pursuing and paperclips are not. Then, it would stop pursuing pleasure instead of paperclips. Thus, it would begin pursuing what is objectively worth pursuing.

This argument is really straightforward. If moral realism were true, then if something became super smart, so too would it realize that some things were worth pursuing. If it were really rational, it would start pursuing those things. For example, it would realize that pleasure is worth pursuing, and pursue it. There are lots of objections to it, which I’ll address. Ultimately, these objections don’t completely move me, but they make it so that my credence in the Orthogonality thesis is near 50%. Here, I’ll reply to objections.

But moral realism is false?

One reason you might not like this argument is that you think that moral realism is false. The argument depends on moral realism, so if it is false, then the argument will be false too. But I don’t think moral realism is false; see here for extensive arguments for that conclusion. I give it about 85% odds of being true. Still, though, this undercuts my faith in the falsity of the orthogonality thesis somewhat.

However, even if realism may be false, I think there are decent odds that we should take the realists wager. If realism is false, nothing matters, so it’s not bad that everyone dies—see here for more on this. I give conservatively about 50% odds—therefore, the odds of both realism being false and the realists wager failing is about 7.5%; thus, there’s still a 92.5% chance that moral realism is true.

But they just gain instrumental rationality

Here’s one thing that one might think; ASI (artificial superintelligences) just gain instrumental rationality and, as a result of this, they get good at achieving their goals, but not figuring out the right goals. This is maybe more plausible if it is not conscious. This is, I think possible, but not the most likely scenario for a few reasons.

First, the primary scenarios where AI becomes dangerous are the ones where it fooms out of control—and, rather than merely accruing various distinct capacities, becomes very generally intelligent in a short time. But if this happened, it would become generally intelligent, and realize that pleasure is worth pursuing and suffering is bad. I think that instrumental rationality is just a subset of general rationality, so we’d have no more reason to expect it to be only instrumentally rational than only rational at reasoning about objects that are not black holes. If it is generally able to reason, even about unpredictable domains, this will apply to the moral domain.

Second, I think that evolution is a decent parallel. The reason why evolutionary debunking arguments are wrong is that evolution gave us adept general reasoning capacities which made us able to figure out morality. Evolution built us for breeding, but the mesa-optimizer inside of us made us figure out that Future Tuesday Indifference is irrational. This gives us some reason to think AI would figure this out too. The fact that GPT4 has no special problem with morality also gives us some reason to think this—it can talk about morality just as coherently as other things.

Third, the AI would probably, to take over the world, have to learn about various mathematical facts and modal facts. It would be bizarre to suppose that the AI taking over the earth and turning it into paperclips doesn’t know calculus or that there can’t be married bachelors. But if it can figure out the non-natural mathematical or modal facts, it would also be able to figure out the non-natural moral facts.

But of course we can imagine an arbitrarily smart paperclip maximizer

It seems like the most common thing people say in support of the orthogonality thesis is that we can surely imagine an arbitrarily smart being that is just maximizing paper clips. But this is surely misleading. In a similar way, we can imagine an arbitrarily smart being that is perfectly good at reasoning in all domains except that it thinks evolution is false. There are people like that—the smart subset of creationists (it’s a small subset). But nonetheless, we should expect AI to figure out that evolution is true, because it can reason generally.

The question is not whether we can imagine an otherwise intelligent agent would be able to just maximize paperclips. It’s whether, in the process of designing an agent 100,000 times smarter than Von Neumann, that agent would figure out that some things just aren’t worth doing. And so the superficial ‘imagine a smart paperclip maximizer’ thought experiments are misleading.

But won’t the agent have built-in desires that can’t be overridden

This objection is eloquently stated by Bostrom in his paper on the orthogonality thesis.

It would suffice to assume, for example, that an agent—be it ever so intelligent—can be motivated to pursue any course of action if the agent happens to have certain standing desires of some sufficient, overriding strength.

But I don’t think that this undercuts the argument very much for two reasons. First, we cannot just directly program values into the AI. We simply train it through reinforcement learning, and whichever AI develops is the one that we allow to take over. Since the early days of AI, we’ve learned that it’s hard to program in explicit values into the AI. The way we get the best chess-playing AIs is by having them play lots of chess games and do machine learning, not program in rules mechanistically. And if we do figure out how to directly program values into AI, it would be much easier to solve alignment—we just train it on lots of ethical data, the same way we do for GPT4, but with more data.

Second, I think that this premise is false. Suppose you were really motivated to maximize paperclips—you just had a strong psychological aversion to other things. Once you experienced pleasure, you’d realize that that was more worth bringing about, because it is good. The same way that, through reflection, we can overcome unreliable evolutionary instincts like an aversion to utility-maximizing incest, disgust-based opposition to various things, and so on, the AI would be able to too! Nature built us with certain desires, and we’ve overcome them through rational reflection.

But the AI won’t be conscious

One might think that, because the AI is not conscious, it would not know what pleasure is like, and thus it would not maximize pleasure or minimize pain, because it would not realize that they matter. I think this is possible but not super likely for two reasons.

First, it’s plausible that, for AI to be smart enough to destroy the world, it would have to be conscious. But this depends on various views about consciousness that people might reject. Specifically, AI might develop pleasure for similar reasons humans did evolutionarily.

Second, if AI is literally 100,000 times smarter than Von Neumann, it might be able to figure out things about consciousness—such as its desirability—without experiencing it.

Third, AI would plausibly try to experience consciousness, for the same reason that humans might try to experience something if aliens said that it was good, and maybe the only thing that’s objectively good. If we were fully rational and there were lots of aliens declaring the goodness of shmeasure, we would try to experience shmeasure. Similarly, the rational AI would plausibly try to experience pleasure.

Even if moral realism is true, the moral facts won’t be motivating

One might be a humean about motivation, and think only preexisting desires can generate motivation. Thus, because the AI had no preexisting desire to avoid suffering, it would not want to. But I think this is false.

The future Tuesday indifference case shows that. If one was fully rational, they would not have future Tuesday indifference, because it’s irrational. Similarly, if one was fully rational they’d realize that it’s better to be happy than make paperclips.

One might worry that the AI would only try to maximize its own well-being—thus, it would learn that well-being is good, but not care about others’ well-being. But I think this is false—it would realize that the distinction between itself and others is totally arbitrary, as Parfit argues in reasons and persons (summarized by Richard here). This thesis is controversial, but I think true if moral realism is true.

Scott Aaronson says it well

In the Orthodox AI-doomers’ own account, the paperclip-maximizing AI would’ve mastered the nuances of human moral philosophy far more completely than any human—the better to deceive the humans, en route to extracting the iron from their bodies to make more paperclips. And yet the AI would never once use all that learning to question its paperclip directive. I acknowledge that this is possible. I deny that it’s trivial.

Yes, there were Nazis with PhDs and prestigious professorships. But when you look into it, they were mostly mediocrities, second-raters full of resentment for their first-rate colleagues (like Planck and Hilbert) who found the Hitler ideology contemptible from beginning to end. Werner Heisenberg, Pascual Jordan—these are interesting as two of the only exceptions. Heidegger, Paul de Man—I daresay that these are exactly the sort of “philosophers” who I’d have expected to become Nazis, even if I hadn’t known that they did become Nazis.

If the AI really knew everything about philosophy, it would realize that egoism is wrong, and one is rationally required to care about others pleasure. This is as trivial as explaining why the AI wouldn’t smoke—because it’s irrational to do so.

But also, even if we think the AI only cares about its pleasure, that seems probably better than the status quo. Even if it turns the world into paperclips, this would be basically a utility monster scenario, which is plausibly fine. It’s not ideal, but maybe better than the status quo. Also, what’s to say it would not care about others? When one realizes that well-being is good, even views like Sidgwick say it’s basically up to the agent to decide rationally whether to pursue its own welfare or that of others. But then there’s a good chance it would do that is best overall!

But what if they kill everyone because we’re not that important

One might worry that, as a result of becoming super intelligent, the AI would realize that, for example, utilitarianism is correct. Then it would turn us into paperclips in order to maximize utility. But I don’t think this is a big risk. For one, if the AI figures out the correct objective morality, then it would only do this if it were objectively good. But if it’s objectively good to kill us, then we should be killed.

It would be unfortunate for us, but if things are bad for us but objectively good, we shouldn’t try to avoid them. So we morally ought not be worried about this scenario. If it would be objectively wrong to turn us into utilitronium, then the AI wouldn’t do it, if I’ve been right up to this point.

Also, it’s possible that they wouldn’t kill us for complicated decision theory reasons, but that point is a bit more complicated and would take us too far afield.

But what about smart psychophaths?

One objection I’ve heard to this is that it’s disproven by smart psychopaths. There are lots of people who don’t care about others who are very smart. Thus, being smart can’t make a person moral. However, I don’t think this undercuts the argument.

First, we don’t have any smart people who don’t care about their suffering either. Thus, even if being smart doesn’t make a person automatically care about others, if it would make them care about themselves, that’s still a non-disastrous scenario. Especially if it turns the hellish natural landscape into paperclips.

Second, I don’t think it’s at all obvious that one is rationally required to care about others. It requires one to both understand a complex argument by Parfit and then do what one has most reason to do. Most humans suffer from akrasia. Fully rational AIs would not.

Right now, people disagree about whether type A physicalism is true. But presumably, superintelligent AIs would settle that question. Thus, the existence of smart psychopaths doesn’t disprove that rationality makes one not turn people into paperclips any more than the existence of smart people who think type A physicalism is true and other smart people who think it is false disproves that perfect rationality would allow one to settle the question of whether type A physicalism is true.

But isn’t this anthropomorphization?

Nope! I think AIs will be alien in many ways. I just think that, if they’re very smart and rational, then if I’m right about what rationality requires, they’ll do those things that I think are required by rationality.

But aren’t these controversial philosophical assumptions?

Yes; this is a good reason not to be complacent! However, if one was previously of the belief that there’s like a 99% chance that we’ll all die, and they think that the philosophical views I defend are plausible, then they should only be like 50% sure we’ll all die. Of course, I generally think that risks are lower than that, but this is a reason to not abandon all hope. Even if alignment fails and all other anti doomer arguments are wrong, this is a good reason not to abandon hope. We are not almost certainly fucked, though the risks are such that people should do way more research.

Objections? Questions? Reasons why the moral law demands that I be turned into a paperclip? Leave a comment!