Thoughts on “Human-Compatible”

The pur­pose of this book is to ex­plain why [su­per­in­tel­li­gence] might be the last event in hu­man his­tory and how to make sure that it is not… The book is in­tended for a gen­eral au­di­ence but will, I hope, be of value in con­vinc­ing spe­cial­ists in ar­tifi­cial in­tel­li­gence to re­think their fun­da­men­tal as­sump­tions.

Yes­ter­day, I ea­gerly opened my copy of Stu­art Rus­sell’s Hu­man Com­pat­i­ble (mir­ror­ing his Cen­ter for Hu­man-Com­pat­i­ble AI, where I’ve worked the past two sum­mers). I’ve been cu­ri­ous about Rus­sell’s re­search agenda, and also how Rus­sell ar­gued the case so con­vinc­ingly as to gar­ner the fol­low­ing ac­cla­ma­tions from two Tur­ing Award win­ners:

Hu­man Com­pat­i­ble made me a con­vert to Rus­sell’s con­cerns with our abil­ity to con­trol our up­com­ing cre­ation—su­per-in­tel­li­gent ma­chines. Un­like out­side alarmists and fu­tur­ists, Rus­sell is a lead­ing au­thor­ity on AI. His new book will ed­u­cate the pub­lic about AI more than any book I can think of, and is a delight­ful and up­lift­ing read.—Judea Pearl

This beau­tifully writ­ten book ad­dresses a fun­da­men­tal challenge for hu­man­ity: in­creas­ingly in­tel­li­gent ma­chines that do what we ask but not what we re­ally in­tend. Essen­tial read­ing if you care about our fu­ture. —Yoshua Bengio

Ben­gio even re­cently lent a rea­soned voice to a de­bate on in­stru­men­tal con­ver­gence!

Bring­ing the AI com­mu­nity up-to-speed

I think the book will greatly help AI pro­fes­sion­als un­der­stand key ar­gu­ments, avoid clas­sic mis­steps, and ap­pre­ci­ate the se­ri­ous challenge hu­man­ity faces. Rus­sell straight­for­wardly de­bunks com­mon ob­jec­tions, writ­ing with both can­dor and charm.

I must ad­mit, it’s great to see such a promi­nent de­bunk­ing; I still re­mem­ber, early in my con­cern about al­ign­ment, hear­ing one pro­fes­sional re­spond to the en­tire idea of be­ing con­cerned about AGI with a lazy ad hominem dis­mis­sal. Like, hello? This is our fu­ture we’re talk­ing about!

But Rus­sell re­al­izes that most peo­ple don’t in­ten­tion­ally ar­gue in bad faith; he struc­tures his ar­gu­ments with the un­der­stand­ing and char­ity re­quired to ease the difficulty of chang­ing one’s mind. (Although I wish he’d be a lit­tle less sassy with LeCun, un­der­stand­able as his frus­tra­tion may be)

More im­por­tant than hav­ing fish, how­ever, is know­ing how to fish; Rus­sell helps train the right men­tal mo­tions in his read­ers:

With a bit of prac­tice, you can learn to iden­tify ways in which the achieve­ment of more or less any fixed ob­jec­tive can re­sult in ar­bi­trar­ily bad out­comes. [Rus­sell goes on to de­scribe spe­cific ex­am­ples and strate­gies] (p139)

He some­how ex­plains the differ­ence be­tween the Pla­tonic as­sump­tions of RL and the re­al­ity of a hu­man-level rea­soner, while also in­tro­duc­ing wire­head­ing. He cov­ers the util­ity-re­ward gap, ex­plain­ing that our un­der­stand­ing of real-world agency is so crude that we can’t even co­her­ently talk about the “pur­pose” of eg AlphaGo. He ex­plains in­stru­men­tal sub­goals. Th­ese bits are so, so good.

Now for the main course, for those already fa­mil­iar with the ba­sic ar­gu­ments:

The agenda

Please re­al­ize that I’m re­ply­ing to my un­der­stand­ing of Rus­sell’s agenda as com­mu­ni­cated in a non­tech­ni­cal book for the gen­eral pub­lic; I also don’t have a men­tal model of Rus­sell per­son­ally. Still, I’m work­ing with what I’ve got.

Here’s my sum­mary: re­ward un­cer­tainty through some ex­ten­sion of a CIRL-like setup, ac­count­ing for hu­man ir­ra­tional­ity through our sci­en­tific knowl­edge, do­ing ag­gre­gate prefer­ence util­i­tar­i­anism for all of the hu­mans on the planet, dis­count­ing peo­ple by how well their be­liefs map to re­al­ity, per­haps down­weight­ing mo­ti­va­tions such as envy (to miti­gate the prob­lem of ev­ery­one want­ing po­si­tional goods). One challenge is to­wards what prefer­ence-shap­ing situ­a­tions the robot should guide us (maybe we need meta-prefer­ence learn­ing?). Rus­sell also has a vi­sion of many agents, each work­ing to rea­son­ably pur­sue the wishes of their own­ers (while be­ing con­sid­er­ate of oth­ers).

I’m go­ing to sim­plify the situ­a­tion and just ex­press my con­cerns about the case of one ir­ra­tional hu­man, one robot.

There’s fully up­dated defer­ence:

One pos­si­ble scheme in AI al­ign­ment is to give the AI a state of moral un­cer­tainty im­ply­ing that we know more than the AI does about its own util­ity func­tion, as the AI’s meta-util­ity func­tion defines its ideal tar­get. Then we could tell the AI, “You should let us shut you down be­cause we know some­thing about your ideal tar­get that you don’t, and we es­ti­mate that we can op­ti­mize your ideal tar­get bet­ter with­out you.”

The ob­sta­cle to this scheme is that be­lief states of this type also tend to im­ply that an even bet­ter op­tion for the AI would be to learn its ideal tar­get by ob­serv­ing us. Then, hav­ing ‘fully up­dated’, the AI would have no fur­ther rea­son to ‘defer’ to us, and could pro­ceed to di­rectly op­ti­mize its ideal tar­get.

which Rus­sell par­tially ad­dresses by ad­vo­cat­ing en­sur­ing re­al­iz­abil­ity, and avoid­ing fea­ture mis­speci­fi­ca­tion by (some­how) al­low­ing for dy­namic ad­di­tion of pre­vi­ously un­known fea­tures (see also In­cor­rigi­bil­ity in the CIRL Frame­work). But sup­pos­ing we don’t have this kind of model mis­speci­fi­ca­tion, I don’t see how the “AI sim­ply fully com­putes the hu­man’s policy, up­dates, and then no longer lets us cor­rect it” is­sue is ad­dressed. If you’re re­ally con­fi­dent that com­put­ing the hu­man policy lets you just ex­tract the true prefer­ences un­der the re­al­iz­abil­ity as­sump­tions, maybe this is fine? I sus­pect Rus­sell has more to say here that didn’t make it onto the printed page.

There’s also the is­sue of get­ting a good enough hu­man mis­take model, and figur­ing out peo­ple’s be­liefs, all while at­tempt­ing to learn their prefer­ences (see the value learn­ing se­quence).

Now, it would be pretty silly to re­ply to an out­lined re­search agenda with “but spe­cific prob­lems X, Y, and Z!”, be­cause the whole point of fur­ther re­search is to solve prob­lems. How­ever, my con­cerns are more struc­tural. Cer­tain AI de­signs lend them­selves to more ro­bust­ness against things go­ing wrong (in speci­fi­ca­tion, train­ing, or sim­ply hav­ing fewer as­sump­tions). It seems to me that the un­cer­tainty-based ap­proach is quite de­mand­ing on get­ting com­po­nent af­ter com­po­nent “right enough”.

Let me give you an ex­am­ple of some­thing which is in­tu­itively “more ro­bust” to me: ap­proval-di­rected agency.

Con­sider a hu­man Hugh, and an agent Arthur who uses the fol­low­ing pro­ce­dure to choose each ac­tion:

Es­ti­mate the ex­pected rat­ing Hugh would give each ac­tion if he con­sid­ered it at length. Take the ac­tion with the high­est ex­pected rat­ing.

Here, the ap­proval-policy does what a pre­dic­tor says to do at each time step, which is differ­ent from max­i­miz­ing a sig­nal. Its shape feels differ­ent to me; the policy isn’t shaped to max­i­mize some re­ward sig­nal (and pur­sue in­stru­men­tal sub­goals). Er­rors in pre­dic­tion al­most cer­tainly don’t pro­duce a policy ad­ver­sar­ial to hu­man in­ter­ests.

How does this com­pare with the un­cer­tainty ap­proach? Let’s con­sider one thing it seems we need to get right:

Where in the world is the hu­man?

How will the agent ro­bustly lo­cate the hu­man whose prefer­ences it’s learn­ing, and why do we need to worry about this?

Well, a novice might worry “what if the AI doesn’t prop­erly cleave re­al­ity at its joints, rely­ing on a bad rep­re­sen­ta­tion of the world?”. But, hav­ing good pre­dic­tive ac­cu­racy is in­stru­men­tally use­ful for max­i­miz­ing the re­ward sig­nal, so we can ex­pect that its im­plicit rep­re­sen­ta­tion of the world con­tinu­ally im­proves (i.e., it comes to find a nice effi­cient en­cod­ing). We don’t have to worry about this—the AI is in­cen­tivized to get this right.

How­ever, if the AI is meant to de­duce and fur­ther the prefer­ences of that sin­gle hu­man, it has to find that hu­man. But, be­fore the AI is op­er­a­tional, how do we point to our con­cept of “this per­son” in a yet-un­formed model whose en­cod­ing prob­a­bly doesn’t cleave re­al­ity along those same lines? Even if we fix the struc­ture of the AI’s model so we can point to that hu­man, it might then have in­stru­men­tal in­cen­tives to mod­ify the model so it can make bet­ter pre­dic­tions.

Why does it mat­ter so much that we point ex­actly to the hu­man? Well, then we’re ex­trap­o­lat­ing the “prefer­ences” of some­thing that is not the per­son (or a per­son?) - the pre­dicted hu­man policy in this case seems highly sen­si­tive to the de­tails of the per­son or en­tity be­ing pointed to. This seems like it could eas­ily end in tragedy, and (strong be­lief, weakly held) doesn’t seem like the kind of prob­lem that has a clean solu­tion. this sort of thing seems to hap­pen quite of­ten for pro­pos­als which hinge on things-in-on­tolo­gies.

Hu­man ac­tion mod­els, mis­take mod­els, etc. are also difficult in this way, and we have to get them right. I’m not nec­es­sar­ily wor­ried about the difficul­ties them­selves, but that the frame­work seems so sen­si­tive to them.


This book is most definitely an im­por­tant read for both the gen­eral pub­lic and AI spe­cial­ists, pre­sent­ing a thought-pro­vok­ing agenda with worth­while in­sights (even if I don’t see how it all ends up fit­ting to­gether). To me, this seems like a key tool for out­reach.

Just think: in how many wor­lds does al­ign­ment re­search benefit from the ad­vo­cacy of one of the most dis­t­in­guished AI re­searchers ever?