Can few-shot learning teach AI right from wrong?

0: Introduction

In a post a few months ago on point­ing to en­vi­ron­men­tal goals, Abram Dem­ski re­minded me of the ap­peal of defin­ing good be­hav­ior by ex­ten­sional ex­am­ples. He uses the ex­am­ple of build­ing bridges. The AI does a bunch of un­su­per­vised learn­ing to ex­plore the simu­la­tion en­vi­ron­ment, so that when hu­mans show it just a few la­beled ex­am­ples of good bridges, it will have pre-learned some high-level con­cepts that let it eas­ily clas­sify good bridges.

Un­for­tu­nately, this doesn’t work, as Abram ex­plains. But it seems like it should, in some sense—it seems like we have a ba­sic ap­proach that would work if only we un­der­stood some con­fus­ing de­tails bet­ter. Maybe that’s not so, but I think it’s worth some effort.

One way of look­ing at this is­sue is that we’re try­ing to un­der­stand con­cept learn­ing—how the AI can em­u­late a hu­man un­der­stand­ing of the world. Another way is as im­ple­ment­ing an un­der­stand­ing of refer­ence—giv­ing the AI ex­am­ples is an at­tempt to point at some “thing,” and we want the AI to take this as a cue to find the con­cept be­ing pointed at, not just look at the “finger” do­ing the point­ing.

Over the last cou­ple months I’ve been read­ing and think­ing on and off about refer­ence, and I’ve got about three posts worth of thoughts. This post will try to com­mu­ni­cate what kind of value learn­ing scheme I’m even talk­ing about, point out some flaws, and provide a lit­tle back­ground. The sec­ond post will start spec­u­lat­ing about ways to get around some of these flaws and prob­a­bly be the most ap­pli­ca­ble to AI, and the third post will be about hu­mans and philos­o­phy of refer­ence.

1: Se­cond Introduction

The goal, broadly, is to build an AI that satis­fies hu­man val­ues. But no AI is go­ing to know what hu­man val­ues are, or what it means to satisfy them, un­less we can com­mu­ni­cate those things, and it can learn them.

The im­pos­si­ble method to do this is to write down what it means to satisfy hu­man val­ues as a long list of pro­gram in­struc­tions. Most rele­vant to this post, it’s im­pos­si­ble be­cause no­body can write down hu­man val­ues by hand—we em­body them, but we can’t op­er­a­tional­ize them any more than we can write down the fre­quency spec­trum of the sounds we hear.

If we can’t du­pli­cate hu­man val­ues by hand, the only re­main­ing op­tion seems to be ma­chine learn­ing. The hu­man has some com­pli­cated defi­ni­tion of “the right thing,” and we just need to use [in­sert your fa­vorite method] to teach this con­cept to the AI. The only trou­ble is that we’re still a lit­tle bit fuzzy on how to define “hu­man,” “has,” “defi­ni­tion,” and “teach” in that sen­tence.

Still, value learn­ing in­tu­itively seems promis­ing. It’s like how, if you don’t speak the same lan­guage as some­one, you can still com­mu­ni­cate by point­ing. Given an AI with a com­pre­hen­sive model of the world, it seems like we should be able to give it ex­am­ples of hu­man val­ues be­ing satis­fied and say, some­how, “do that stuff.”

To be more con­crete, we might imag­ine a spe­cific AI. Not some­thing at the pin­na­cle of ca­pa­bil­ity, just a toy model. This AI is made of three parts:

  • An un­su­per­vised learn­ing al­gorithm that learns a model of the world and rules for pre­dict­ing the fu­ture state of the model.

  • A su­per­vised al­gorithm that takes some la­beled sen­sory ex­am­ples of good be­hav­ior, plus the model of the world learned by the un­su­per­vised al­gorithm, and tries to clas­sify which se­quences of states of the model are good.

  • To take ac­tions, the AI just fol­lows strate­gies that re­sult in strongly clas­sified-good states of its pre­dic­tive model.

We’re still many break­throughs away from know­ing how to build those parts, but if we as­sume they’ll work, we can get a pic­ture of an AI that has a com­pli­cated pre­dic­tive model of the world, then tries to find the com­mon­al­ities of the train­ing ex­am­ples and push the world in that di­rec­tion. What could go wrong?

2: What could go wrong?

I have a big ol’ soft spot for that AI de­sign. But it will im­me­di­ately, deeply fail. The thing it learns to clas­sify is sim­ply not go­ing to be what we wanted it to learn. We’re go­ing to show it ex­am­ples that, from our per­spec­tive, are an ex­ten­sional defi­ni­tion of satis­fy­ing hu­man val­ues. But the con­cept we’re try­ing to com­mu­ni­cate is a very small tar­get to hit, and there are many other hy­pothe­ses that match the data about as well.

Just as deep learn­ing to rec­og­nize images will learn to rec­og­nize the tex­ture of fur. or the shape of a dog’s eye. but might not learn the silhou­ette of the en­tire dog, the clas­sifier can do well on train­ing ex­am­ples with­out need­ing to learn all the fea­tures we as­so­ci­ate with hu­man value. And just as an image-rec­og­nizer will think that the grass in the back­ground is an im­por­tant part of be­ing a dog, the clas­sifier will learn things from ex­am­ples that we think of as spu­ri­ous.

The AI build­ing its own world-model from un­la­beled ob­ser­va­tions will help with these prob­lems the same way that pro­vid­ing more data would, but it doesn’t provide a prin­ci­pled solu­tion. There will still be no ex­act analogue of the hu­man con­cept we want to com­mu­ni­cate, be­cause of miss­ing or spu­ri­ous fea­tures. Or the AI might use a differ­ent level of ab­strac­tion than we ex­pected—hu­mans view the world through a par­tic­u­lar way of chunk­ing atoms into larger ob­jects and a par­tic­u­lar way of mod­el­ing other hu­mans. Our ex­am­ples might be more similar when con­sid­ered in terms of fea­tures we didn’t even think of.

Even worse, in some sense we are hop­ing that the AI isn’t smart enough to learn the true ex­pla­na­tion for the train­ing ex­am­ples, which is that hu­mans picked them. We’re try­ing to com­mu­ni­cate good­ness, not “the sort of thing hu­mans se­lect for the train­ing set.” To the ex­tent that hu­mans are not se­cure sys­tems, there are ad­ver­sar­ial ex­am­ples that would get us to in­clude them in the train­ing set with­out be­ing good. We might imag­ine “mar­ket­ing ex­am­ples” op­ti­mized for per­sua­sive­ness at the cost of good­ness, or a se­ries of flash­ing lights that would have caused you to hit the but­ton to in­clude it in the train­ing set. This failure is the AI de­sign be­ing coded to look at the point­ing finger, not the ob­ject pointed at.

All of these prob­lems show up across many agent de­signs, im­ply­ing that we are do­ing some­thing wrong and don’t know how to do it right. Here’s the miss­ing abil­ity to do refer­ence—to go from refer­ring speech-acts to the thing be­ing referred to. In or­der to figure out what hu­mans mean, the AI should re­ally rea­son about hu­man in­ten­tion and hu­man cat­e­gories (Den­nett’s in­ten­tional stance), and we have to un­der­stand the AI’s rea­son­ing well enough to con­nect it to the mo­ti­va­tional sys­tem be­fore turn­ing the AI on.

3: Re­lated ideas

The same lack of un­der­stand­ing that stands in our way to just tel­ling an AI “Do what I mean!” also ap­pears in mi­ni­a­ture when­ever we’re try­ing to teach con­cepts to an AI. MIRI uses the ex­am­ple of a di­a­mond-max­i­miz­ing AI as some­thing that seems sim­ple but re­quires com­mu­ni­cat­ing a con­cept (“di­a­mond”) to the AI, par­tic­u­larly in a way that’s ro­bust to on­tolog­i­cal shifts. Abram Dem­ski uses the ex­am­ple of teach­ing an AI to build good bridges, some­thing that’s easy to ap­prox­i­mate with cur­rent ma­chine learn­ing meth­ods, but may fail badly if we hook that ap­prox­i­ma­tion up to a pow­er­ful agent. On the more ap­plied end, a re­cent high­light is IBM train­ing a recom­men­da­tion sys­tem to learn guidelines from ex­am­ples.

All those ex­am­ples might be thought of as “stuff”—di­a­mond, or good bridges, or age-ap­pro­pri­ate movies. But we also want the AI to be able to learn about pro­cesses. This is re­lated to Dy­lan Had­field-Menell et al.’s work on co­op­er­a­tive in­verse re­in­force­ment learn­ing (CIRL), which uses the ex­am­ple of mo­tion on a grid (as is com­mon for toy prob­lems in re­in­force­ment learn­ing—see also Deep­mind’s AI safety grid­wor­lds).

There are also broad con­cepts, like “love,” which seem im­por­tant to us but which don’t seem to be stuff or pro­cesses per se. We might imag­ine cash­ing out such ab­strac­tions in terms of nat­u­ral lan­guage pro­cess­ing and ver­bal rea­son­ing, or as vari­ables that help pre­dict stuff and pro­cesses. Th­ese will come up later, be­cause it does seem rea­son­able that “hu­man flour­ish­ing” might be this sort of con­cept.

4: Philos­o­phy! *shakes fist*

This refer­ence is­sue is clearly within the field of philos­o­phy. So it would be re­ally won­der­ful if we could just go to the philos­o­phy liter­a­ture and find a recipe for how an AI needs to be­have if it’s to learn hu­man refer­ents from hu­man refer­ences. Or at least it might have some im­por­tant in­sights that would help with de­vel­op­ing such a recipe. I thought it was worth a look.

Long story short, it wasn’t. The philos­o­phy liter­a­ture on refer­ence is largely fo­cused on refer­ence as a thing that in­heres in sen­tences and other com­mu­ni­ca­tions. Here’s how silly it can get: it is con­sid­ered a se­ri­ous prob­lem (by some) how, if there are mul­ti­ple peo­ple named Vanya Ivanova, your spo­ken sen­tence about Vanya Ivanova can figure out which one it should be re­ally refer­ring to, so that it can have the right refer­ence-essence.

Since com­put­ers can’t per­ceive refer­ence-essence, what I was look­ing for was some sort of func­tional ac­count of how the listener in­ter­prets refer­ences. And there are cer­tainly peo­ple who’ve been think­ing more in this di­rec­tion. Gricean im­pli­ca­ture and so on. But even here, the peo­ple like Kent Bach, sharp peo­ple who seem to be go­ing in the nec­es­sary di­rec­tion, aren’t pro­duc­ing work that looks to be of use to AI. The stan­dards of the field just don’t re­quire you to be that pre­cise or that func­tion­al­ist.

5: What this se­quence isn’t

This post has been all about set­ting the stage and point­ing out prob­lems. We started with this dream of an AI de­sign that learns to clas­sify strate­gies as hu­man-friendly based on a small num­ber of ex­am­ples of hu­man-friendly ac­tions or states, plus a pow­er­ful world-model. And then we im­me­di­ately got into trou­ble.

My pur­pose is not to defend or fix this spe­cific de­sign-dream. It’s to work on the deeper prob­lem that lies be­hind many in­di­vi­d­ual prob­lems with this de­sign. And by that I mean our ig­no­rance and con­fu­sion about how an AI should im­ple­ment the un­der­stand­ing of refer­ence.

In fact our ex­am­ple AI prob­a­bly isn’t sta­bly self-im­prov­ing, or cor­rigible in Eliezer’s sense of fully up­dated defer­ence, or hu­man-leg­ible, or fail-safe-ish if we tell the AI the wrong thing. And that’s fine, be­cause that’s not what the se­quence is about. The ques­tion at hand is how to tell the AI any­thing at all, and have it un­der­stand what we meant, as we meant it.