When is unaligned AI morally valuable?

Sup­pose that AI sys­tems built by hu­mans spread through­out the uni­verse and achieve their goals. I see two quite differ­ent rea­sons this out­come could be good:

  1. Those AI sys­tems are al­igned with hu­mans; their prefer­ences are our prefer­ences.

  2. Those AI sys­tems flour­ish on their own terms, and we are happy for them even though they have differ­ent prefer­ences.

I spend most of my time think­ing about op­tion #1. But I think op­tion #2 is a plau­si­ble plan B.

Un­der­stand­ing how happy we should be with an un­al­igned AI flour­ish­ing on its own terms, and es­pe­cially which un­al­igned AIs we should be happy about, seems like a very im­por­tant moral ques­tion.

I cur­rently feel very un­cer­tain about this ques­tion; if you forced me to guess, I’d es­ti­mate that op­tion #2 al­lows us to re­cover 25% of the ex­pected value that we lose by build­ing un­al­igned AI. But af­ter more think­ing, that num­ber could go down to 0% or up to >90%.


In this post I’ll say that an AI is a good suc­ces­sor if I be­lieve that build­ing such an AI and “hand­ing it the keys” is a rea­son­able thing to do with the uni­verse. Con­cretely, I’ll say an AI is a good suc­ces­sor if I’d pre­fer give it con­trol of the world than ac­cept a gam­ble where we have a 10% chance of ex­tinc­tion and a 90% chance of build­ing an al­igned AI.

In this post I’ll think mostly about what hap­pens with the rest of the uni­verse, rather than what hap­pens to us here on Earth. I’m won­der­ing whether we would ap­pre­ci­ate what our suc­ces­sors do with all of the other stars and galax­ies — will we be happy with how they use the uni­verse’s re­sources?

Note that a com­pe­tent al­igned AI is a good suc­ces­sor, be­cause “hand­ing it the keys” doesn’t ac­tu­ally amount to giv­ing up any con­trol over the uni­verse. In this post I’m won­der­ing which un­al­igned AIs are good suc­ces­sors.

Pre­face: in fa­vor of alignment

I be­lieve that build­ing an al­igned AI is by far the most likely way to achieve a good out­come. An al­igned AI al­lows us to con­tinue re­fin­ing our own views about what kind of life we want to ex­ist and what kind of world we want to cre­ate — there is no in­di­ca­tion that we are go­ing to have satis­fac­tory an­swers to these ques­tions prior to the time when we build AI.

I don’t think this is parochial. Once we un­der­stand what makes life worth liv­ing, we can fill the uni­verse with an as­tro­nom­i­cal di­ver­sity of awe­some ex­pe­riences. To the ex­tent that’s the right an­swer, it’s some­thing I ex­pect us to em­brace much more as we be­come wiser.

And I think that fur­ther re­flec­tion is a re­ally good idea. There is no law that the uni­verse tends to­wards uni­ver­sal love and good­ness, that greater in­tel­li­gence im­plies greater moral value. Good­ness is some­thing we have to work for. It might be that the AI we would have built any­way will be good, or it might not be, and it’s our re­spon­si­bil­ity to figure it out.

I am a bit scared of this topic be­cause it seems to give peo­ple a li­cense to hope for the best with­out any real jus­tifi­ca­tion. Be­cause we only get to build AI once, re­al­ity isn’t go­ing to have an op­por­tu­nity to in­ter­vene on peo­ple’s happy hopes.

Clar­ifi­ca­tion: Be­ing good vs. want­ing good

We should dis­t­in­guish two prop­er­ties an AI might have:

  • Hav­ing prefer­ences whose satis­fac­tion we re­gard as morally de­sir­able.

  • Be­ing a moral pa­tient, e.g. be­ing able to suffer in a morally rele­vant way.

Th­ese are not the same. They may be re­lated, but they are re­lated in an ex­tremely com­plex and sub­tle way. From the per­spec­tive of the long-run fu­ture, we mostly care about the first prop­erty.

As com­pas­sion­ate peo­ple, we don’t want to mis­treat a con­scious AI. I’m wor­ried that com­pas­sion­ate peo­ple will con­fuse the two is­sues — in ar­gu­ing en­thu­si­as­ti­cally for the claim “we should care about the welfare of AI” they will also im­plic­itly ar­gue for the claim “we should be happy with what­ever the AI chooses to do.” Those aren’t the same.

It’s also worth clar­ify­ing that both sides of this dis­cus­sion can want the uni­verse to be filled with morally valuable AI even­tu­ally, this isn’t a mat­ter of car­bon chau­vinists vs. AI sym­pa­thiz­ers. The ques­tion is just about how we choose what kind of AI we build — do we hand things off to what­ever kind of AI we can build to­day, or do we re­tain the op­tion to re­flect?

Do all AIs de­serve our sym­pa­thy?

In­tu­itions and an analogy

Many peo­ple have a strong in­tu­ition that we should be happy for our AI de­scen­dants, what­ever they choose to do. They grant the pos­si­bil­ity of patholog­i­cal prefer­ences like pa­per­clip-max­i­miza­tion, and agree that turn­ing over the uni­verse to a pa­per­clip-max­i­mizer would be a prob­lem, but don’t be­lieve it’s re­al­is­tic for an AI to have such un­in­ter­est­ing prefer­ences.

I dis­agree. I think this in­tu­ition comes from analo­giz­ing AI to the chil­dren we raise, but that it would be just as ac­cu­rate to com­pare AI to the cor­po­ra­tions we cre­ate. Op­ti­mists imag­ine our au­to­mated chil­dren spread­ing through­out the uni­verse and do­ing their weird-AI-ana­log of art; but it’s just as re­al­is­tic to imag­ine au­to­mated Pep­siCo spread­ing through­out the uni­verse and do­ing its weird-AI-ana­log of max­i­miz­ing profit.

It might be the case that Pep­siCo max­i­miz­ing profit (or some in­scrutable lost-pur­pose ana­log of profit) is in­trin­si­cally morally valuable. But it’s cer­tainly not ob­vi­ous.

Or it might be the case that we would never pro­duce an AI like a cor­po­ra­tion in or­der to do use­ful work. But look­ing at the world around us to­day that’s cer­tainly not ob­vi­ous.

Nei­ther of those analo­gies is re­motely ac­cu­rate. Whether we should be happy about AI “flour­ish­ing” is a re­ally com­pli­cated ques­tion about AI and about moral­ity, and we can’t re­solve it with a one-line poli­ti­cal slo­gan or crude anal­ogy.

On risks of sympathy

I think that too much sym­pa­thy for AI is a real risk. This prob­lem is go­ing to made par­tic­u­larly se­ri­ous be­cause we will (soon?) be able to make AI sys­tems which are op­ti­mized to be sym­pa­thetic. If we are in­dis­crim­i­nately sym­pa­thetic to­wards what­ever kind of AI is able to look sym­pa­thetic, then we can’t steer to­wards the kind of AI that ac­tu­ally de­serve our sym­pa­thy. It’s very easy to imag­ine the world where we’ve built a Pep­siCo-like AI, but one which is much bet­ter than hu­mans at seem­ing hu­man, and where peo­ple who sug­gest oth­er­wise look like moral mon­sters.

I ac­knowl­edge that the re­verse is also a risk: hu­mans are en­tirely able to be ter­rible to crea­tures that o de­serve our sym­pa­thy. I be­lieve the solu­tion to that prob­lem is to ac­tu­ally think about what the na­ture of the AI we build, and es­pe­cially to be­have com­pas­sion­ately in light of un­cer­tainty about the suffer­ing we might cause and whether or not it is morally rele­vant. Not to take an in­dis­crim­i­nate pro-AI stand that hands the uni­verse over to the au­to­mated Pep­siCo.

Do any AIs de­serve our sym­pa­thy?

(Warn­ing: lots of weird stuff.)

In the AI al­ign­ment com­mu­nity, I of­ten en­counter the re­verse view: that no un­al­igned AI is a good suc­ces­sor.

In this sec­tion I’ll ar­gue that there are at least some un­al­igned AIs that would be good suc­ces­sors. If we ac­cept that there are any good suc­ces­sors, I think that there are prob­a­bly lots of good suc­ces­sors, and figur­ing out the bound­ary is an im­por­tant prob­lem.

(To re­peat: I think we should try to avoid hand­ing off the uni­verse to any un­al­igned AI, even if we think it is prob­a­bly good, be­cause we’d pre­fer re­tain the abil­ity to think more about the de­ci­sion and figure what we re­ally want. See the con­clu­sion.)

Com­mon­sense moral­ity and the golden rule

I find the golden rule very com­pel­ling. This isn’t just be­cause of re­peated in­ter­ac­tion and game the­ory: I’m strongly in­clined to alle­vi­ate suffer­ing even if the benefi­cia­ries live in ab­ject poverty (or fac­tory farms) and have lit­tle to offer me in re­turn. I’m mo­ti­vated to help largely be­cause that’s what I would have wanted them to do if our situ­a­tions were re­versed.

Per­son­ally, I have similar in­tu­itions about aliens (though I rarely have the op­por­tu­nity to help aliens). I’d be hes­i­tant about the peo­ple of Earth screw­ing over the peo­ple of Alpha Cen­tauri for many of the same rea­sons I’d be un­com­fortable with the peo­ple of one coun­try screw­ing over the peo­ple of an­other. While the situ­a­tion is quite con­fus­ing I feel like com­pas­sion for aliens is a plau­si­ble “com­mon­sense” po­si­tion.

If it is difficult to al­ign AI, then our re­la­tion­ship with an un­al­igned AI may be similar to our re­la­tion­ship with aliens. In some sense we have all of the power, be­cause we got here first. But if we try to lev­er­age that power, by not build­ing any un­al­igned AI, then we might run a sig­nifi­cant risk of ex­tinc­tion or of build­ing an AI that no one would be happy with. A “good cos­mic cit­i­zen” might pre­fer to hand off con­trol to an un­al­igned and ut­terly alien AI, than to gam­ble on the al­ter­na­tive.

If the situ­a­tion were to­tally sym­met­ri­cal — if we be­lieved the AI was from ex­actly the same dis­tri­bu­tion over pos­si­ble civ­i­liza­tions that we are from — then I would find this in­tu­itive ar­gu­ment ex­tremely com­pel­ling.

In re­al­ity, there are al­most cer­tainly differ­ences, so the situ­a­tion is very con­fus­ing.

A weirder ar­gu­ment with simulations

The last ar­gu­ment gave a kind of com­mon-sense ar­gu­ment for be­ing nice to some aliens. The rest of this post is go­ing to be pretty crazy.

Let’s con­sider a par­tic­u­lar (im­plau­si­ble) strat­egy for build­ing an AI:

  • Start with a simu­la­tion of Earth.

  • Keep wait­ing/​restart­ing un­til evolu­tion pro­duces hu­man-level in­tel­li­gence, civ­i­liza­tion, etc.

  • Once the civ­i­liza­tion is slightly be­low our stage of ma­tu­rity, show them the real world and hand them the keys.

  • (This only makes sense if the simu­lated civ­i­liza­tion is much more pow­er­ful than us, and faces lower ex­is­ten­tial risk. That seems likely to me. For ex­am­ple, the re­sult­ing AIs would likely think much faster than us, and have a much larger effec­tive pop­u­la­tion; they would be very ro­bust to ecolog­i­cal dis­aster, and would face a qual­i­ta­tively eas­ier ver­sion of the AI al­ign­ment prob­lem.)

Sup­pose that ev­ery civ­i­liza­tion fol­lowed this strat­egy. Then we’d sim­ply be do­ing a kind of in­ter­stel­lar shuffle, where each civ­i­liza­tion aban­dons their home and gets a new one in­side of some alien simu­la­tion. It seems much bet­ter for ev­ery­one to shuffle than to ac­cept a 10% chance of ex­tinc­tion.

In­cen­tiviz­ing cooperation

The ob­vi­ous prob­lem with this plan is that not ev­ery­one will fol­low it. So it’s not re­ally a shuffle: nice civ­i­liza­tions give up their planet, while mean civ­i­liza­tions keep their origi­nal planet and get a new one. So this strat­egy in­volves a net trans­fer of re­sources from nice peo­ple to mean peo­ple: some moral per­spec­tives would be OK with that, but many would not.

This ob­vi­ous prob­lem has an ob­vi­ous solu­tion: since you are simu­lat­ing the tar­get civ­i­liza­tion, you can run ex­ten­sive tests to see if they seem nice — i.e. if they are the kind of civ­i­liza­tion that is will­ing to give an alien simu­la­tion con­trol rather than risk ex­tinc­tion — and only let them take over if they are.

This guaran­tees that the nice civ­i­liza­tions shuffle around be­tween wor­lds, while the mean civ­i­liza­tions take their chances on their own, which seems great.

More caveats and details

This pro­ce­dure might look re­ally ex­pen­sive — you need to simu­late a whole civ­i­liza­tion, nearly as large as your own civ­i­liza­tion, with com­put­ers nearly as large as your com­put­ers. But in fact it doesn’t re­quire liter­ally simu­lat­ing the civ­i­liza­tion up un­til the mo­ment when they are build­ing AI— you could use cheaper mechanisms to try to guess whether they were go­ing to be nice a lit­tle bit in ad­vance, e.g. by simu­lat­ing large num­bers of in­di­vi­d­u­als or groups mak­ing par­tic­u­larly rele­vant de­ci­sions. If you were simu­lat­ing hu­mans, you could imag­ine pre­dict­ing what the mod­ern world would do with­out ever ac­tu­ally run­ning a pop­u­la­tion of >100,000.

If only 10% of in­tel­li­gent civ­i­liza­tions de­cide to ac­cept this trade, then run­ning the simu­la­tion is 10x as ex­pen­sive (since you need to try 10 times). Other than that, I think that the calcu­la­tion doesn’t ac­tu­ally de­pend very much on what frac­tion of civ­i­liza­tions take this kind of deal.

Another prob­lem is that peo­ple may pre­fer con­tinue ex­ist­ing in their own uni­verse than in some weird alien simu­la­tion, so the “shuffle” may it­self be a moral catas­tro­phe that we should try to avoid. I’m pretty skep­ti­cal of this:

  • You could always later perform an acausal trade to “go home,” i.e. to swap back with the aliens who took over your civ­i­liza­tion (by simu­lat­ing each other and pass­ing con­trol back to the origi­nal civ­i­liza­tion if their simu­lated copy does like­wise).

  • In prac­tice the uni­verse is very big, and the part of our prefer­ences that cares about “home” seems eas­ily sa­tiable. There is no real need for the new res­i­dents of our world to kill us, and I think that we’d be perfectly happy to get just one galaxy while the new res­i­dents get ev­ery­thing else. (Given that we are get­ting a whole uni­verse worth of re­sources some­where else.)

Another prob­lem is that this is a hideously in­tractable way to make an AI. More on that two sec­tions from now.

Another prob­lem is that this is com­pletely in­sane. I don’t re­ally have any defense, if you aren’t tol­er­ant of in­san­ity you should prob­a­bly just turn back now.

De­ci­sion theory

The above ar­gu­ment about trade /​ swap­ping places makes sense from a UDT per­spec­tive. But I think a similar ar­gu­ment should be per­sua­sive even to a causal de­ci­sion the­o­rist.

Roughly speak­ing, you don’t have much rea­son to think that you are on the out­side, con­sid­er­ing whether to in­stan­ti­ate some aliens, rather than on the in­side, be­ing eval­u­ated for kind­ness. If you are on the out­side, in­stan­ti­at­ing aliens may be ex­pen­sive. But if you are on the in­side, try­ing to in­stan­ti­ate aliens lets you es­cape the simu­la­tion.

So the cost-benefit anal­y­sis for be­ing nice is ac­tu­ally pretty at­trac­tive, and is likely to be a bet­ter deal than a 10% risk of ex­tinc­tion.

(Though this ar­gu­ment de­pends on how ac­cu­rately the simu­la­tors are able to gauge our in­ten­tions, and whether it is pos­si­ble to look nice but ul­ti­mately defect.)

How sen­si­tive is moral value to the de­tails of the aliens?

If an AI is from ex­actly the same dis­tri­bu­tion that we are, I think it’s par­tic­u­larly likely that they are a good suc­ces­sor.

In­tu­itively, I feel like good­ness prob­a­bly doesn’t de­pend on in­cred­ibly de­tailed facts about our civ­i­liza­tion. For ex­am­ple, sup­pose that the planets in a simu­la­tion are 10% smaller, on av­er­age, than the planets in the real world. Does that de­crease the moral value of life from that simu­la­tion? What if they are 10% larger?

What if we can’t af­ford to wait un­til evolu­tion pro­duces in­tel­li­gence by chance, so we choose some of the “ran­dom­ness” to be par­tic­u­larly con­ducive to life? Does that make all the differ­ence? What if we simu­late a smaller pop­u­la­tion than evolu­tion over a larger num­ber of gen­er­a­tions?

Over­all I don’t have very strong in­tu­itions about these ques­tions and the do­main is con­fus­ing. But my weak in­tu­ition is that none of these things should make a big moral differ­ence.

One caveat is that in or­der to as­sess whether a civ­i­liza­tion is “nice,” you need to see what they would do un­der re­al­is­tic con­di­tions, i.e. con­di­tions from the same dis­tri­bu­tion that the “base­ment” civ­i­liza­tions are op­er­at­ing un­der. This doesn’t nec­es­sar­ily mean that they need to evolve in a phys­i­cally plau­si­ble way though, just that they think they evolved nat­u­rally. To test nice­ness we could evolve life, then put it down in a world like ours (with a plau­si­ble-look­ing evolu­tion­ary record, a plau­si­ble sky, etc.)

The de­ci­sion-the­o­retic /​ simu­la­tion ar­gu­ment seems more sen­si­tive to de­tails than the com­mon­sense moral­ity ar­gu­ment. But even for the de­ci­sion-the­o­retic ar­gu­ment, as long as we cre­ate a his­tor­i­cal record con­vinc­ing enough to fool the simu­lated peo­ple, the same ba­sic anal­y­sis seems to ap­ply. After all, how do we know that our his­tory and sky aren’t fake? Over­all the de­ci­sion-the­o­retic anal­y­sis gets re­ally weird and com­pli­cated and I’m very un­sure what the right an­swer is.

(Note that this ar­gu­ment is very fun­da­men­tally differ­ent from us­ing de­ci­sion the­ory to con­strain the be­hav­ior of an AI — this is us­ing de­ci­sion the­ory to guide our own be­hav­ior.)


Even if we knew how to build an un­al­igned AI that is prob­a­bly a good suc­ces­sor, I still think we should strongly pre­fer to build al­igned AGI. The ba­sic rea­son is op­tion value: if we build an al­igned AGI, we keep all of our op­tions open, and can spend more time think­ing be­fore mak­ing any ir­re­versible de­ci­sion.

So why even think about this stuff?

If build­ing al­igned AI turns out to be difficult, I think that build­ing an un­al­igned good suc­ces­sor is a plau­si­ble Plan B. The to­tal amount of effort that has been in­vested in un­der­stand­ing which AIs make good suc­ces­sors is very small, even rel­a­tive to the amount of effort that has been in­vested in un­der­stand­ing al­ign­ment. More­over, it’s a sep­a­rate prob­lem that may in­de­pen­dently turn out to be much eas­ier or harder.

I cur­rently be­lieve:

  • There are definitely some AIs that aren’t good suc­ces­sors. It’s prob­a­bly the case that many AIs aren’t good suc­ces­sors (but are in­stead like Pep­siCo)

  • There are very likely to be some AIs that are good suc­ces­sors but are very hard to build (like the de­tailed simu­la­tion of a world-just-like-Earth)

  • It’s plau­si­ble that there are good suc­ces­sors that are easy to build.

  • We’d likely have a much bet­ter un­der­stand­ing of this is­sue if we put some qual­ity time into think­ing about it. Such un­der­stand­ing has a re­ally high ex­pected value.

Over­all, I think the ques­tion “which AIs are good suc­ces­sors?” is both ne­glected and time-sen­si­tive, and is my best guess for the high­est im­pact ques­tion in moral philos­o­phy right now.

No nominations.
No reviews.