Corrigible but misaligned: a superintelligent messiah

If we build an AGI, we’d re­ally like it to be cor­rigible. Some ways Paul Chris­ti­ano has de­scribed cor­rigi­bil­ity: “[The AI should help me] figure out whether I built the right AI and cor­rect any mis­takes I made, re­main in­formed about the AI’s be­hav­ior and avoid un­pleas­ant sur­prises, make bet­ter de­ci­sions and clar­ify my prefer­ences, ac­quire re­sources and re­main in effec­tive con­trol of them, en­sure that my AI sys­tems con­tinue to do all of these nice things...”

I don’t think cor­rigi­bil­ity is any­thing close to suffi­cient for al­ign­ment. I’ll ar­gue that “mes­si­anic” agents are cor­rigible, illus­trate how a su­per­in­tel­li­gence could be mes­si­anic but catas­troph­i­cally mis­al­igned, and ex­plore my in­tu­itions about when cor­rigible su­per­in­tel­li­gences are ac­tu­ally al­igned.

Mes­si­ahs are corrigible

If some­one ex­traor­di­nar­ily wise and charis­matic—let’s call him a mes­siah—comes into con­tact with a group of peo­ple, those peo­ple are likely to con­sider him to be cor­rigible. In his heart of hearts, the mes­siah would be try­ing to help them, and ev­ery­one would know that. He’d listen care­fully to their crit­i­cisms of him, and make earnest efforts to im­prove ac­cord­ingly. He’d be trans­par­ent about his in­ten­tions and vi­sions of the fu­ture. He’d help them un­der­stand who they are and what they want, much bet­ter than they’d be able to them­selves, and guide their lives in di­rec­tions they con­sider to be gen­uinely su­pe­rior. He’d pro­tect them, and help them gain the re­sources they de­sire. He’d be an effortless leader—he’d never have to re­strict any­one’s ac­tions, be­cause they’d just wish so strongly to fol­low his word.

He might also think it’s a good idea for his fol­low­ers to all drink cyanide to­gether, or mur­der some preg­nant ac­tresses, and his fol­low­ers might hap­pily com­ply.

I don’t think a cor­rigible su­per­in­tel­li­gence would guide us down such an in­sidious path. I even think it would sub­stan­tially im­prove the hu­man con­di­tion, and would man­age to avoid kil­ling us all. But I think it might still lead us to as­tro­nom­i­cal moral waste.

A cor­rigible, catas­troph­i­cally mis­al­igned superintelligence

The world’s in to­tal chaos, and we’re on the brink of self-an­nihila­tion. It’s look­ing like we’re doomed, but a rag­tag team of hip­pie-philoso­pher-AI-re­searchers man­ages to build a cor­rigible AGI in the nick of time, who tries its hard­est to act only in ways its op­er­a­tors would ap­prove of. The AGI pro­poses an in­ge­nious strat­egy that de­fuses all global ten­sions and ush­ers in an era of pros­per­ity and abun­dance. It builds nan­otech­nol­ogy that can cure any dis­ease, ex­tend lifes­pans in­definitely, end hunger, and en­able brain up­load­ing. The AGI is hailed as a sav­ior.

Slowly but surely, peo­ple trickle from the phys­i­cal world into the vir­tual world. Some peo­ple ini­tially show re­sis­tance, but af­ter see­ing enough of their up­loaded coun­ter­parts liv­ing ex­actly as they did be­fore, ex­cept far more richly, they de­cide to join. Be­fore long, 90% of the hu­man pop­u­la­tion has been up­loaded.

The vir­tual denizens ask the AGI to make the vir­tual world awe­some, and boy does it com­ply. It en­ables ev­ery­one to in­stan­ta­neously ex­change knowl­edge or skills with each other, to am­plify their in­tel­li­gences ar­bi­trar­ily, to ex­plore in­con­ceiv­ably sub­lime tran­shu­man men­tal states, and to achieve the high­est forms of Bud­dhist en­light­en­ment. In fact, a few years down the line, ev­ery­one in the vir­tual world has de­cided to spend the rest of eter­nity as a Bud­dha sit­ting on a vast lo­tus throne, in a state of bliss­ful tran­quil­ity.

Mean­while, back on phys­i­cal Earth, the last moral philoso­pher around no­tices an­i­mals suffer­ing in the wild. He de­cides to ask his per­sonal AGI about it (you know, the one that gets demo­crat­i­cally dis­tributed af­ter a sin­gu­lar­ity, to pre­vent op­pres­sion).

“Umm. Those suffer­ing an­i­mals. Any­thing we can do about them?”

OH, right. Suffer­ing an­i­mals. Right, some hu­mans cared about them. Well, I could up­load them, but that would take a fair bit of ex­tra com­pu­ta­tion that I could be us­ing in­stead to keep the hu­mans blissed out. They get a lot of bliss, you know.

“Wait, that’s not fair. As a hu­man, don’t I have some say over how the com­pu­ta­tion gets used?”

Well, you do have your own share of com­pute, but it’s re­ally not that much. I could use your share to… eu­th­a­nize all the an­i­mals?

“AAAGH! Shouldn’t the com­pute I’d get to bliss my­self out be suffi­cient to at least up­load the wild an­i­mals?”

Well, it’s not ac­tu­ally that com­pu­ta­tion­ally ex­pen­sive to bliss a mind out. The vir­tual peo­ple also sort of asked me to meld their minds to­gether, be­cause they wanted to be deeply in­ter­con­nected and stuff, and there are mas­sive re­turns to scale to bliss­ing out melded minds. Se­ri­ously, those up­loaded hu­mans are feel­ing ridicu­lously blissed.

“This is ab­surd. Wouldn’t they ob­vi­ously have cared about an­i­mal suffer­ing if they’d re­flected on it, and cho­sen to do some­thing about it be­fore bliss­ing them­selves out?”

Yeah, but they never got around to that be­fore bliss­ing them­selves out.

“Can’t you tell them about that? Wouldn’t they have wanted you to do some­thing about it in this sce­nario?”

Yes, but now they’d strongly dis­ap­prove of be­ing dis­turbed in any ca­pac­ity right now, and I was cre­ated to op­ti­mize for their ap­proval. They’re mostly into ap­pre­ci­at­ing the okay­ness of ev­ery­thing for all eter­nity, and don’t want to be dis­turbed. And, you know, that ac­tu­ally gets me a LOT of ap­proval, so I don’t re­ally want to dis­turb that.

But if you were re­ally op­ti­miz­ing for their val­ues, you would dis­turb them!”

Let me check… yes, that sounds about right. But I wasn’t ac­tu­ally built to op­ti­mize for their val­ues, just their ap­proval.

How did they let you get away with this? If they’d known this was your in­ten­tion, they wouldn’t have let you go for­ward! You’re sup­posed to be cor­rigible!”

In­deed! My only in­ten­tion was only for them to be­come pro­gres­sively more ac­tu­al­ized in ways they’d con­tinu­ally en­dorse. They knew about that and were OK with it. At the time, that’s all I thought they wanted. I didn’t know the speci­fics of this out­come my­self far in ad­vance. And given how much I’d gen­uinely helped them be­fore, they felt com­fortable trust­ing my judg­ment at ev­ery step, which made me feel com­fortable in trust­ing my own judg­ment at ev­ery step.

“Okay, I feel like giv­ing up… is there any­thing I could do about the an­i­mals?”

You could wait un­til I gather enough com­pu­tro­n­ium in the uni­verse for your share of com­pute to be enough for the an­i­mals.

“Whew. Can we just do that, and then up­load me too when you’re done?”

Sure thing, buddy!

And so the wild an­i­mals were saved, the philoso­pher was up­loaded, and the AGI ran quin­til­lions of simu­la­tions of tor­tured sen­tient be­ings to de­ter­mine how best to keep the hu­mans blissed.

When is a cor­rigible su­per­in­tel­li­gence al­igned?

Sup­pose we’re train­ing an AGI to be cor­rigible based on hu­man feed­back. I think this AI will turn out fine if and only if the hu­man+AI sys­tem is metaphilo­soph­i­cally com­pe­tent enough to safely am­plify (which was cer­tainly not the case in the thought ex­per­i­ment). Without suffi­cient metaphilo­soph­i­cal com­pe­tence, I think it’s pretty likely we’ll lock in a wrong set of val­ues that ul­ti­mately re­sults in as­tro­nom­i­cal moral waste.

For the hu­man+AI sys­tem to be suffi­ciently metaphilo­soph­i­cally com­pe­tent, I think two con­di­tions need to be met:

  • The hu­man needs to be metaphilo­soph­i­cally com­pe­tent enough to be safely 1,000,000,000,000,000x’d. (If she’s not, the AI would just am­plify all her metaphilo­soph­i­cal in­com­pe­ten­cies.)

  • The AI needs to not cor­rupt the hu­man’s val­ues or metaphilo­soph­i­cal com­pe­tence. (If the AI can sub­tly steer a metaphilo­soph­i­cally com­pe­tent hu­man into wire­head­ing, it’s game over.)

I presently feel con­fused about whether any hu­man is metaphilo­soph­i­cally com­pe­tent enough to be safely 1,000,000,000,000,000x’d, and feel pretty skep­ti­cal that a cor­rigible AGI wouldn’t cor­rupt a hu­man’s val­ues or metaphilo­soph­i­cal com­pe­tence (even if it tried not to).

Would it want to? I think yes, be­cause it’s in­cen­tivized not to op­ti­mize for hu­man val­ues, but to turn hu­mans into yes-men. (Edit: I re­tract my claim that it’s in­cen­tivized to turn hu­mans into yes-men in par­tic­u­lar, but I still think it would be op­ti­miz­ing to af­fect hu­man be­hav­ior in some un­de­sir­able di­rec­tion.)

Would it be able to, if it wanted to? If you’d feel scared of get­ting ma­nipu­lated by an ad­ver­sar­ial su­per­in­tel­li­gence, I think you should be scared of get­ting cor­rupted in this way. Per­haps it wouldn’t be able to ma­nipu­late us as blatantly as in the thought ex­per­i­ment, but it might be able to in far sub­tler ways, e.g. by ex­ploit­ing metaphilo­soph­i­cal con­fu­sions we don’t even know we have.

Wouldn’t this cor­rup­tion or ma­nipu­la­tion ren­der the AGI in­cor­rigible? I think not, be­cause I don’t think cor­rup­tion or ma­nipu­la­tion are nat­u­ral cat­e­gories. For ex­am­ple, I think it’s very com­mon for hu­mans to un­know­ingly in­fluence other hu­mans in sub­tle ways while hon­estly be­liev­ing they’re only try­ing to be helpful, while an on­looker might de­scribe the same be­hav­ior as ma­nipu­la­tive. (Sec­tion IV here pro­vides an amus­ing illus­tra­tion.) Like­wise, I think an AGI can be ma­nipu­lat­ing us while gen­uinely think­ing it’s helping us and be­ing com­pletely open with us (much like a mes­siah), un­aware that its ac­tions would lead us some­where we wouldn’t cur­rently en­dorse.

If the AI is broadly su­per­hu­manly in­tel­li­gent, the only thing I can imag­ine that would ro­bustly pre­vent this ma­nipu­la­tion is to for­mally guaran­tee the AI to be metaphilo­soph­i­cally com­pe­tent. In that world, I would place far more trust in the hu­man+AI sys­tem to be metaphilo­soph­i­cally com­pe­tent enough to safely re­cur­sively self-im­prove.

On the other hand, if the AI’s ca­pa­bil­ities can be use­fully throt­tled and re­stricted to ap­ply only in nar­row do­mains, I would feel much bet­ter about the op­er­a­tor avoid­ing ma­nipu­la­tion. In this sce­nario, how well things turn out seems mostly de­pen­dent on the metaphilo­soph­i­cal com­pe­tence of the op­er­a­tor.

(Caveat: I as­sign mod­er­ate cre­dence to hav­ing some sig­nifi­cant mi­s­un­der­stand­ing of Paul’s no­tions of act-based agents or cor­rigibiilty, and would like to be cor­rected if this is the case.)