Can few-shot learn­ing teach AI right from wrong?

0: Introduction

In a post a few months ago on point­ing to en­vir­on­mental goals, Abram Dem­ski re­minded me of the ap­peal of de­fin­ing good be­ha­vior by ex­ten­sional ex­amples. He uses the ex­ample of build­ing bridges. The AI does a bunch of un­su­per­vised learn­ing to ex­plore the sim­u­la­tion en­vir­on­ment, so that when hu­mans show it just a few labeled ex­amples of good bridges, it will have pre-learned some high-level con­cepts that let it eas­ily clas­sify good bridges.

Un­for­tu­nately, this doesn’t work, as Abram ex­plains. But it seems like it should, in some sense—it seems like we have a ba­sic ap­proach that would work if only we un­der­stood some con­fus­ing de­tails bet­ter. Maybe that’s not so, but I think it’s worth some ef­fort.

One way of look­ing at this is­sue is that we’re try­ing to un­der­stand concept learn­ing—how the AI can emu­late a hu­man un­der­stand­ing of the world. Another way is as im­ple­ment­ing an un­der­stand­ing of ref­er­ence—giv­ing the AI ex­amples is an at­tempt to point at some “thing,” and we want the AI to take this as a cue to find the concept be­ing poin­ted at, not just look at the “fin­ger” do­ing the point­ing.

Over the last couple months I’ve been read­ing and think­ing on and off about ref­er­ence, and I’ve got about three posts worth of thoughts. This post will try to com­mu­nic­ate what kind of value learn­ing scheme I’m even talk­ing about, point out some flaws, and provide a little back­ground. The second post will start spec­u­lat­ing about ways to get around some of these flaws and prob­ably be the most ap­plic­able to AI, and the third post will be about hu­mans and philo­sophy of ref­er­ence.

1: Se­cond Introduction

The goal, broadly, is to build an AI that sat­is­fies hu­man val­ues. But no AI is go­ing to know what hu­man val­ues are, or what it means to sat­isfy them, un­less we can com­mu­nic­ate those things, and it can learn them.

The im­possible method to do this is to write down what it means to sat­isfy hu­man val­ues as a long list of pro­gram in­struc­tions. Most rel­ev­ant to this post, it’s im­possible be­cause nobody can write down hu­man val­ues by hand—we em­body them, but we can’t op­er­a­tion­al­ize them any more than we can write down the fre­quency spec­trum of the sounds we hear.

If we can’t du­plic­ate hu­man val­ues by hand, the only re­main­ing op­tion seems to be ma­chine learn­ing. The hu­man has some com­plic­ated defin­i­tion of “the right thing,” and we just need to use [in­sert your fa­vor­ite method] to teach this concept to the AI. The only trouble is that we’re still a little bit fuzzy on how to define “hu­man,” “has,” “defin­i­tion,” and “teach” in that sen­tence.

Still, value learn­ing in­tu­it­ively seems prom­ising. It’s like how, if you don’t speak the same lan­guage as someone, you can still com­mu­nic­ate by point­ing. Given an AI with a com­pre­hens­ive model of the world, it seems like we should be able to give it ex­amples of hu­man val­ues be­ing sat­is­fied and say, some­how, “do that stuff.”

To be more con­crete, we might ima­gine a spe­cific AI. Not some­thing at the pin­nacle of cap­ab­il­ity, just a toy model. This AI is made of three parts:

  • An un­su­per­vised learn­ing al­gorithm that learns a model of the world and rules for pre­dict­ing the fu­ture state of the model.

  • A su­per­vised al­gorithm that takes some labeled sens­ory ex­amples of good be­ha­vior, plus the model of the world learned by the un­su­per­vised al­gorithm, and tries to clas­sify which se­quences of states of the model are good.

  • To take ac­tions, the AI just fol­lows strategies that res­ult in strongly clas­si­fied-good states of its pre­dict­ive model.

We’re still many break­throughs away from know­ing how to build those parts, but if we as­sume they’ll work, we can get a pic­ture of an AI that has a com­plic­ated pre­dict­ive model of the world, then tries to find the com­mon­al­it­ies of the train­ing ex­amples and push the world in that dir­ec­tion. What could go wrong?

2: What could go wrong?

I have a big ol’ soft spot for that AI design. But it will im­me­di­ately, deeply fail. The thing it learns to clas­sify is simply not go­ing to be what we wanted it to learn. We’re go­ing to show it ex­amples that, from our per­spect­ive, are an ex­ten­sional defin­i­tion of sat­is­fy­ing hu­man val­ues. But the concept we’re try­ing to com­mu­nic­ate is a very small tar­get to hit, and there are many other hy­po­theses that match the data about as well.

Just as deep learn­ing to re­cog­nize im­ages will learn to re­cog­nize the tex­ture of fur. or the shape of a dog’s eye. but might not learn the sil­hou­ette of the en­tire dog, the clas­si­fier can do well on train­ing ex­amples without need­ing to learn all the fea­tures we as­so­ci­ate with hu­man value. And just as an im­age-re­cog­nizer will think that the grass in the back­ground is an im­port­ant part of be­ing a dog, the clas­si­fier will learn things from ex­amples that we think of as spuri­ous.

The AI build­ing its own world-model from un­labeled ob­ser­va­tions will help with these prob­lems the same way that provid­ing more data would, but it doesn’t provide a prin­cipled solu­tion. There will still be no ex­act ana­logue of the hu­man concept we want to com­mu­nic­ate, be­cause of miss­ing or spuri­ous fea­tures. Or the AI might use a dif­fer­ent level of ab­strac­tion than we ex­pec­ted—hu­mans view the world through a par­tic­u­lar way of chunking atoms into lar­ger ob­jects and a par­tic­u­lar way of mod­el­ing other hu­mans. Our ex­amples might be more sim­ilar when con­sidered in terms of fea­tures we didn’t even think of.

Even worse, in some sense we are hop­ing that the AI isn’t smart enough to learn the true ex­plan­a­tion for the train­ing ex­amples, which is that hu­mans picked them. We’re try­ing to com­mu­nic­ate good­ness, not “the sort of thing hu­mans se­lect for the train­ing set.” To the ex­tent that hu­mans are not se­cure sys­tems, there are ad­versarial ex­amples that would get us to in­clude them in the train­ing set without be­ing good. We might ima­gine “mar­ket­ing ex­amples” op­tim­ized for per­suas­ive­ness at the cost of good­ness, or a series of flash­ing lights that would have caused you to hit the but­ton to in­clude it in the train­ing set. This fail­ure is the AI design be­ing coded to look at the point­ing fin­ger, not the ob­ject poin­ted at.

All of these prob­lems show up across many agent designs, im­ply­ing that we are do­ing some­thing wrong and don’t know how to do it right. Here’s the miss­ing abil­ity to do ref­er­ence—to go from re­fer­ring speech-acts to the thing be­ing re­ferred to. In or­der to fig­ure out what hu­mans mean, the AI should really reason about hu­man in­ten­tion and hu­man cat­egor­ies (Den­nett’s in­ten­tional stance), and we have to un­der­stand the AI’s reas­on­ing well enough to con­nect it to the mo­tiv­a­tional sys­tem be­fore turn­ing the AI on.

3: Related ideas

The same lack of un­der­stand­ing that stands in our way to just telling an AI “Do what I mean!” also ap­pears in mini­ature whenever we’re try­ing to teach con­cepts to an AI. MIRI uses the ex­ample of a dia­mond-max­im­iz­ing AI as some­thing that seems simple but re­quires com­mu­nic­at­ing a concept (“dia­mond”) to the AI, par­tic­u­larly in a way that’s ro­bust to on­to­lo­gical shifts. Abram Dem­ski uses the ex­ample of teach­ing an AI to build good bridges, some­thing that’s easy to ap­prox­im­ate with cur­rent ma­chine learn­ing meth­ods, but may fail badly if we hook that ap­prox­im­a­tion up to a power­ful agent. On the more ap­plied end, a re­cent high­light is IBM train­ing a re­com­mend­a­tion sys­tem to learn guidelines from ex­amples.

All those ex­amples might be thought of as “stuff”—dia­mond, or good bridges, or age-ap­pro­pri­ate movies. But we also want the AI to be able to learn about pro­cesses. This is re­lated to Dylan Had­field-Menell et al.’s work on co­oper­at­ive in­verse re­in­force­ment learn­ing (CIRL), which uses the ex­ample of mo­tion on a grid (as is com­mon for toy prob­lems in re­in­force­ment learn­ing—see also Deep­mind’s AI safety grid­worlds).

There are also broad con­cepts, like “love,” which seem im­port­ant to us but which don’t seem to be stuff or pro­cesses per se. We might ima­gine cash­ing out such ab­strac­tions in terms of nat­ural lan­guage pro­cessing and verbal reas­on­ing, or as vari­ables that help pre­dict stuff and pro­cesses. These will come up later, be­cause it does seem reas­on­able that “hu­man flour­ish­ing” might be this sort of concept.

4: Philo­sophy! *shakes fist*

This ref­er­ence is­sue is clearly within the field of philo­sophy. So it would be really won­der­ful if we could just go to the philo­sophy lit­er­at­ure and find a re­cipe for how an AI needs to be­have if it’s to learn hu­man ref­er­ents from hu­man ref­er­ences. Or at least it might have some im­port­ant in­sights that would help with de­vel­op­ing such a re­cipe. I thought it was worth a look.

Long story short, it wasn’t. The philo­sophy lit­er­at­ure on ref­er­ence is largely fo­cused on ref­er­ence as a thing that in­heres in sen­tences and other com­mu­nic­a­tions. Here’s how silly it can get: it is con­sidered a ser­i­ous prob­lem (by some) how, if there are mul­tiple people named Vanya Ivan­ova, your spoken sen­tence about Vanya Ivan­ova can fig­ure out which one it should be really re­fer­ring to, so that it can have the right ref­er­ence-es­sence.

Since com­puters can’t per­ceive ref­er­ence-es­sence, what I was look­ing for was some sort of func­tional ac­count of how the listener in­ter­prets ref­er­ences. And there are cer­tainly people who’ve been think­ing more in this dir­ec­tion. Gricean im­plicature and so on. But even here, the people like Kent Bach, sharp people who seem to be go­ing in the ne­ces­sary dir­ec­tion, aren’t pro­du­cing work that looks to be of use to AI. The stand­ards of the field just don’t re­quire you to be that pre­cise or that func­tion­al­ist.

5: What this se­quence isn’t

This post has been all about set­ting the stage and point­ing out prob­lems. We star­ted with this dream of an AI design that learns to clas­sify strategies as hu­man-friendly based on a small num­ber of ex­amples of hu­man-friendly ac­tions or states, plus a power­ful world-model. And then we im­me­di­ately got into trouble.

My pur­pose is not to de­fend or fix this spe­cific design-dream. It’s to work on the deeper prob­lem that lies be­hind many in­di­vidual prob­lems with this design. And by that I mean our ig­nor­ance and con­fu­sion about how an AI should im­ple­ment the un­der­stand­ing of ref­er­ence.

In fact our ex­ample AI prob­ably isn’t stably self-im­prov­ing, or cor­ri­gible in Eliezer’s sense of fully up­dated de­fer­ence, or hu­man-legible, or fail-safe-ish if we tell the AI the wrong thing. And that’s fine, be­cause that’s not what the se­quence is about. The ques­tion at hand is how to tell the AI any­thing at all, and have it un­der­stand what we meant, as we meant it.