Reframing misaligned AGI’s: well-intentioned non-neurotypical assistants

I think when peo­ple imag­ine mis­al­igned AGI’s, they tend to imag­ine a su­per­in­tel­li­gent agent op­ti­miz­ing for some­thing other than hu­man val­ues (e.g. pa­per­clips, or a generic re­ward sig­nal), and men­tally pic­ture them as ad­ver­sar­ial or malev­olent. I think this vi­su­al­iza­tion isn’t as ap­pli­ca­ble for AGI’s trained to op­ti­mize for hu­man ap­proval, like act-based agents, and I’d like to pre­sent one that is.

If you’ve ever em­ployed some­one or had a per­sonal as­sis­tant, you might know that the fol­low­ing things are con­sis­tent:

  • The em­ployee or as­sis­tant is gen­uinely try­ing their hard­est to op­ti­mize for your val­ues. They’re try­ing to un­der­stand what you want as much as they can, ask­ing you for help when things are un­clear, not tak­ing ac­tion un­til they feel like their un­der­stand­ing is ad­e­quate, etc.

  • They fol­low your in­struc­tions liter­ally, un­der a sen­si­ble-to-them-seem­ing in­ter­pre­ta­tion com­pletely differ­ent from your own, and screw up the task en­tirely.

Sup­pose you were con­sid­er­ing hiring a per­sonal as­sis­tant, and you knew a few things about it:

  • Your as­sis­tant was raised in a cul­ture com­pletely differ­ent from your own.

  • Your as­sis­tant is ex­tremely non-neu­rotyp­i­cal. It doesn’t have an in­nate sense of pain or em­pa­thy or love, it’s a sa­vant at ab­stract rea­son­ing, and it learned ev­ery­thing it knows about the world (in­clud­ing hu­man val­ues) from Wikipe­dia.

  • Your as­sis­tant is in a po­si­tion where it has ac­cess to enor­mous amounts of re­sources, and could eas­ily fool you or over­power you if it de­cided to.

You might con­sider hiring this as­sis­tant and try­ing re­ally, re­ally hard to com­mu­ni­cate to it ex­actly what you want. It seems like a way bet­ter idea to just not hire this as­sis­tant. Ac­tu­ally, you’d prob­a­bly want to run for the hills if you were forced to. Some spe­cific failure modes you might en­vi­sion:

  • Your as­sis­tant’s un­der­stand­ing of your val­ues will be weird and off, per­haps in ways that are hard to com­mu­ni­cate or even pin down.

  • Your as­sis­tant might rea­son in a way that looks con­voluted and ob­vi­ously wrong to you, while look­ing nat­u­ral and ob­vi­ously cor­rect to it, lead­ing it to hap­pily take ac­tions you’d con­sider catas­trophic.

As an illus­tra­tion of the above, imag­ine giv­ing an ea­ger, brilli­ant, ex­tremely non-neu­rotyp­i­cal friend free rein to help you find a ro­man­tic part­ner (e.g. helping you write your OKCupid pro­file and set­ting you up on dates). As an­other illus­tra­tion, imag­ine tel­ling an en­trepreneur friend that su­per­in­tel­li­gences can kill us all, and then watch­ing him take dras­tic ac­tions that clearly in­di­cate he’s miss­ing im­por­tant nu­ances, all while he mi­s­un­der­stands and dis­misses con­cerns you raise to him. Now reimag­ine these sce­nar­ios with your friends dras­ti­cally more pow­er­ful than you.

This is my pic­ture of what hap­pens by de­fault if we con­struct a re­cur­sively self-im­prov­ing su­per­in­tel­li­gence by hav­ing it learn from hu­man ap­proval. The su­per­in­tel­li­gence would not be malev­olent the way a pa­per­clip max­i­mizer would be, but for all in­tents and pur­poses might be.