The easy goal inference problem is still hard

Posted as part of the AI Align­ment Fo­rum se­quence on Value Learn­ing.

Ro­hin’s note: In this post (origi­nal here), Paul Chris­ti­ano an­a­lyzes the am­bi­tious value learn­ing ap­proach. He con­sid­ers a more gen­eral view of am­bi­tious value learn­ing where you in­fer prefer­ences more gen­er­ally (i.e. not nec­es­sar­ily in the form of a util­ity func­tion), and you can ask the user about their prefer­ences, but it’s fine to imag­ine that you in­fer a util­ity func­tion from data and then op­ti­mize it. The key take­away is that in or­der to in­fer prefer­ences that can lead to su­per­hu­man perfor­mance, it is nec­es­sary to un­der­stand how hu­mans are bi­ased, which seems very hard to do even with in­finite data.

One ap­proach to the AI con­trol prob­lem goes like this:

  1. Ob­serve what the user of the sys­tem says and does.

  2. In­fer the user’s prefer­ences.

  3. Try to make the world bet­ter ac­cord­ing to the user’s prefer­ence, per­haps while work­ing alongside the user and ask­ing clar­ify­ing ques­tions.

This ap­proach has the ma­jor ad­van­tage that we can be­gin em­piri­cal work to­day — we can ac­tu­ally build sys­tems which ob­serve user be­hav­ior, try to figure out what the user wants, and then help with that. There are many ap­pli­ca­tions that peo­ple care about already, and we can set to work on mak­ing rich toy mod­els.

It seems great to de­velop these ca­pa­bil­ities in par­allel with other AI progress, and to ad­dress what­ever difficul­ties ac­tu­ally arise, as they arise. That is, in each do­main where AI can act effec­tively, we’d like to en­sure that AI can also act effec­tively in the ser­vice of goals in­ferred from users (and that this in­fer­ence is good enough to sup­port fore­see­able ap­pli­ca­tions).

This ap­proach gives us a nice, con­crete model of each difficulty we are try­ing to ad­dress. It also pro­vides a rel­a­tively clear in­di­ca­tor of whether our abil­ity to con­trol AI lags be­hind our abil­ity to build it. And by be­ing tech­ni­cally in­ter­est­ing and eco­nom­i­cally mean­ingful now, it can help ac­tu­ally in­te­grate AI con­trol with AI prac­tice.

Over­all I think that this is a par­tic­u­larly promis­ing an­gle on the AI safety prob­lem.

Model­ing imperfection

That said, I think that this ap­proach rests on an op­ti­mistic as­sump­tion: that it’s pos­si­ble to model a hu­man as an im­perfect ra­tio­nal agent, and to ex­tract the real val­ues which the hu­man is im­perfectly op­ti­miz­ing. Without this as­sump­tion, it seems like some ad­di­tional ideas are nec­es­sary.

To iso­late this challenge, we can con­sider a vast sim­plifi­ca­tion of the goal in­fer­ence prob­lem:

The easy goal in­fer­ence prob­lem: Given no al­gorith­mic limi­ta­tions and ac­cess to the com­plete hu­man policy — a lookup table of what a hu­man would do af­ter mak­ing any se­quence of ob­ser­va­tions — find any rea­son­able rep­re­sen­ta­tion of any rea­son­able ap­prox­i­ma­tion to what that hu­man wants.

I think that this prob­lem re­mains wide open, and that we’ve made very lit­tle head­way on the gen­eral case. We can make the prob­lem even eas­ier, by con­sid­er­ing a hu­man in a sim­ple toy uni­verse mak­ing rel­a­tively sim­ple de­ci­sions, but it still leaves us with a very tough prob­lem.

It’s not clear to me whether or ex­actly how progress in AI will make this prob­lem eas­ier. I can cer­tainly see how enough progress in cog­ni­tive sci­ence might yield an an­swer, but it seems much more likely that it will in­stead tell us “Your ques­tion wasn’t well defined.” What do we do then?

I am es­pe­cially in­ter­ested in this prob­lem be­cause I think that “busi­ness as usual” progress in AI will prob­a­bly lead to the abil­ity to pre­dict hu­man be­hav­ior rel­a­tively well, and to em­u­late the perfor­mance of ex­perts. So I re­ally care about the resi­d­ual — what do we need to know to ad­dress AI con­trol, be­yond what we need to know to build AI?

Nar­row domains

We can solve the very easy goal in­fer­ence prob­lem in suffi­ciently nar­row do­mains, where hu­mans can be­have ap­prox­i­mately ra­tio­nally and a sim­ple er­ror model is ap­prox­i­mately right. So far this has been good enough.

But in the long run, hu­mans make many de­ci­sions whose con­se­quences aren’t con­fined to a sim­ple do­main. This ap­proach can can work for driv­ing from point A to point B, but prob­a­bly can’t work for de­sign­ing a city, run­ning a com­pany, or set­ting good poli­cies.

There may be an ap­proach which uses in­verse re­in­force­ment learn­ing in sim­ple do­mains as a build­ing block in or­der to solve the whole AI con­trol prob­lem. Maybe it’s not even a ter­ribly com­pli­cated ap­proach. But it’s not a triv­ial prob­lem, and I don’t think it can be dis­missed eas­ily with­out some new ideas.

Model­ing “mis­takes” is fundamental

If we want to perform a task as well as an ex­pert, in­verse re­in­force­ment learn­ing is clearly a pow­er­ful ap­proach.

But in in the long-term, many im­por­tant ap­pli­ca­tions re­quire AIs to make de­ci­sions which are bet­ter than those of available hu­man ex­perts. This is part of the promise of AI, and it is the sce­nario in which AI con­trol be­comes most challeng­ing.

In this con­text, we can’t use the usual paradigm — “more ac­cu­rate mod­els are bet­ter.” A perfectly ac­cu­rate model will take us ex­actly to hu­man mimicry and no farther.

The pos­si­ble ex­tra oomph of in­verse re­in­force­ment learn­ing comes from an ex­plicit model of the hu­man’s mis­takes or bounded ra­tio­nal­ity. It’s what speci­fies what the AI should do differ­ently in or­der to be “smarter,” what parts of the hu­man’s policy it should throw out. So it im­plic­itly speci­fies which of the hu­man be­hav­iors the AI should keep. The er­ror model isn’t an af­terthought — it’s the main af­fair.

Model­ing “mis­takes” is hard

Ex­ist­ing er­ror mod­els for in­verse re­in­force­ment learn­ing tend to be very sim­ple, rang­ing from Gaus­sian noise in ob­ser­va­tions of the ex­pert’s be­hav­ior or sen­sor read­ings, to the as­sump­tion that the ex­pert’s choices are ran­dom­ized with a bias to­wards bet­ter ac­tions.

In fact hu­mans are not ra­tio­nal agents with some noise on top. Our de­ci­sions are the product of a com­pli­cated mess of in­ter­act­ing pro­cess, op­ti­mized by evolu­tion for the re­pro­duc­tion of our chil­dren’s chil­dren. It’s not clear there is any good an­swer to what a “perfect” hu­man would do. If you were to find any prin­ci­pled an­swer to “what is the hu­man brain op­ti­miz­ing?” the sin­gle most likely bet is prob­a­bly some­thing like “re­pro­duc­tive suc­cess.” But this isn’t the an­swer we are look­ing for.

I don’t think that writ­ing down a model of hu­man im­perfec­tions, which de­scribes how hu­mans de­part from the ra­tio­nal pur­suit of fixed goals, is likely to be any eas­ier than writ­ing down a com­plete model of hu­man be­hav­ior.

We can’t use nor­mal AI tech­niques to learn this kind of model, ei­ther — what is it that makes a model good or bad? The stan­dard view — “more ac­cu­rate mod­els are bet­ter” — is fine as long as your goal is just to em­u­late hu­man perfor­mance. But this view doesn’t provide guidance about how to sep­a­rate the “good” part of hu­man de­ci­sions from the “bad” part.

So what?

It’s rea­son­able to take the at­ti­tude “Well, we’ll deal with that prob­lem when it comes up.” But I think that there are a few things that we can do pro­duc­tively in ad­vance.

  • In­verse re­in­force­ment learn­ing /​ goal in­fer­ence re­search mo­ti­vated by ap­pli­ca­tions to AI con­trol should prob­a­bly pay par­tic­u­lar at­ten­tion to the is­sue of mod­el­ing mis­takes, and to the challenges that arise when try­ing to find a policy bet­ter than the one you are learn­ing from.

  • It’s worth do­ing more the­o­ret­i­cal re­search to un­der­stand this kind of difficulty and how to ad­dress it. This re­search can help iden­tify other prac­ti­cal ap­proaches to AI con­trol, which can then be ex­plored em­piri­cally.