The easy goal in­fer­ence prob­lem is still hard

Pos­ted as part of the AI Align­ment Forum se­quence on Value Learn­ing.

Ro­hin’s note: In this post (ori­ginal here), Paul Chris­ti­ano ana­lyzes the am­bi­tious value learn­ing ap­proach. He con­siders a more gen­eral view of am­bi­tious value learn­ing where you in­fer pref­er­ences more gen­er­ally (i.e. not ne­ces­sar­ily in the form of a util­ity func­tion), and you can ask the user about their pref­er­ences, but it’s fine to ima­gine that you in­fer a util­ity func­tion from data and then op­tim­ize it. The key takeaway is that in or­der to in­fer pref­er­ences that can lead to su­per­hu­man per­form­ance, it is ne­ces­sary to un­der­stand how hu­mans are biased, which seems very hard to do even with in­fin­ite data.

One ap­proach to the AI con­trol prob­lem goes like this:

  1. Ob­serve what the user of the sys­tem says and does.

  2. In­fer the user’s pref­er­ences.

  3. Try to make the world bet­ter ac­cord­ing to the user’s pref­er­ence, per­haps while work­ing along­side the user and ask­ing cla­ri­fy­ing ques­tions.

This ap­proach has the ma­jor ad­vant­age that we can be­gin em­pir­ical work today — we can ac­tu­ally build sys­tems which ob­serve user be­ha­vior, try to fig­ure out what the user wants, and then help with that. There are many ap­plic­a­tions that people care about already, and we can set to work on mak­ing rich toy mod­els.

It seems great to de­velop these cap­ab­il­it­ies in par­al­lel with other AI pro­gress, and to ad­dress whatever dif­fi­culties ac­tu­ally arise, as they arise. That is, in each do­main where AI can act ef­fect­ively, we’d like to en­sure that AI can also act ef­fect­ively in the ser­vice of goals in­ferred from users (and that this in­fer­ence is good enough to sup­port fore­see­able ap­plic­a­tions).

This ap­proach gives us a nice, con­crete model of each dif­fi­culty we are try­ing to ad­dress. It also provides a re­l­at­ively clear in­dic­ator of whether our abil­ity to con­trol AI lags be­hind our abil­ity to build it. And by be­ing tech­nic­ally in­ter­est­ing and eco­nom­ic­ally mean­ing­ful now, it can help ac­tu­ally in­teg­rate AI con­trol with AI prac­tice.

Over­all I think that this is a par­tic­u­larly prom­ising angle on the AI safety prob­lem.

Model­ing imperfection

That said, I think that this ap­proach rests on an op­tim­istic as­sump­tion: that it’s pos­sible to model a hu­man as an im­per­fect ra­tional agent, and to ex­tract the real val­ues which the hu­man is im­per­fectly op­tim­iz­ing. Without this as­sump­tion, it seems like some ad­di­tional ideas are ne­ces­sary.

To isol­ate this chal­lenge, we can con­sider a vast sim­pli­fic­a­tion of the goal in­fer­ence prob­lem:

The easy goal in­fer­ence prob­lem: Given no al­gorithmic lim­it­a­tions and ac­cess to the com­plete hu­man policy — a lookup table of what a hu­man would do after mak­ing any se­quence of ob­ser­va­tions — find any reas­on­able rep­res­ent­a­tion of any reas­on­able ap­prox­im­a­tion to what that hu­man wants.

I think that this prob­lem re­mains wide open, and that we’ve made very little head­way on the gen­eral case. We can make the prob­lem even easier, by con­sid­er­ing a hu­man in a simple toy uni­verse mak­ing re­l­at­ively simple de­cisions, but it still leaves us with a very tough prob­lem.

It’s not clear to me whether or ex­actly how pro­gress in AI will make this prob­lem easier. I can cer­tainly see how enough pro­gress in cog­nit­ive sci­ence might yield an an­swer, but it seems much more likely that it will in­stead tell us “Your ques­tion wasn’t well defined.” What do we do then?

I am es­pe­cially in­ter­ested in this prob­lem be­cause I think that “busi­ness as usual” pro­gress in AI will prob­ably lead to the abil­ity to pre­dict hu­man be­ha­vior re­l­at­ively well, and to emu­late the per­form­ance of ex­perts. So I really care about the re­sid­ual — what do we need to know to ad­dress AI con­trol, bey­ond what we need to know to build AI?

Nar­row domains

We can solve the very easy goal in­fer­ence prob­lem in suf­fi­ciently nar­row do­mains, where hu­mans can be­have ap­prox­im­ately ra­tion­ally and a simple er­ror model is ap­prox­im­ately right. So far this has been good enough.

But in the long run, hu­mans make many de­cisions whose con­sequences aren’t con­fined to a simple do­main. This ap­proach can can work for driv­ing from point A to point B, but prob­ably can’t work for design­ing a city, run­ning a com­pany, or set­ting good policies.

There may be an ap­proach which uses in­verse re­in­force­ment learn­ing in simple do­mains as a build­ing block in or­der to solve the whole AI con­trol prob­lem. Maybe it’s not even a ter­ribly com­plic­ated ap­proach. But it’s not a trivial prob­lem, and I don’t think it can be dis­missed eas­ily without some new ideas.

Model­ing “mis­takes” is fundamental

If we want to per­form a task as well as an ex­pert, in­verse re­in­force­ment learn­ing is clearly a power­ful ap­proach.

But in in the long-term, many im­port­ant ap­plic­a­tions re­quire AIs to make de­cisions which are bet­ter than those of avail­able hu­man ex­perts. This is part of the prom­ise of AI, and it is the scen­ario in which AI con­trol be­comes most chal­len­ging.

In this con­text, we can’t use the usual paradigm — “more ac­cur­ate mod­els are bet­ter.” A per­fectly ac­cur­ate model will take us ex­actly to hu­man mim­icry and no farther.

The pos­sible ex­tra oomph of in­verse re­in­force­ment learn­ing comes from an ex­pli­cit model of the hu­man’s mis­takes or bounded ra­tion­al­ity. It’s what spe­cifies what the AI should do dif­fer­ently in or­der to be “smarter,” what parts of the hu­man’s policy it should throw out. So it im­pli­citly spe­cifies which of the hu­man be­ha­vi­ors the AI should keep. The er­ror model isn’t an af­ter­thought — it’s the main af­fair.

Model­ing “mis­takes” is hard

Ex­ist­ing er­ror mod­els for in­verse re­in­force­ment learn­ing tend to be very simple, ran­ging from Gaus­sian noise in ob­ser­va­tions of the ex­pert’s be­ha­vior or sensor read­ings, to the as­sump­tion that the ex­pert’s choices are ran­dom­ized with a bias to­wards bet­ter ac­tions.

In fact hu­mans are not ra­tional agents with some noise on top. Our de­cisions are the product of a com­plic­ated mess of in­ter­act­ing pro­cess, op­tim­ized by evol­u­tion for the re­pro­duc­tion of our chil­dren’s chil­dren. It’s not clear there is any good an­swer to what a “per­fect” hu­man would do. If you were to find any prin­cipled an­swer to “what is the hu­man brain op­tim­iz­ing?” the single most likely bet is prob­ably some­thing like “re­pro­duct­ive suc­cess.” But this isn’t the an­swer we are look­ing for.

I don’t think that writ­ing down a model of hu­man im­per­fec­tions, which de­scribes how hu­mans de­part from the ra­tional pur­suit of fixed goals, is likely to be any easier than writ­ing down a com­plete model of hu­man be­ha­vior.

We can’t use nor­mal AI tech­niques to learn this kind of model, either — what is it that makes a model good or bad? The stand­ard view — “more ac­cur­ate mod­els are bet­ter” — is fine as long as your goal is just to emu­late hu­man per­form­ance. But this view doesn’t provide guid­ance about how to sep­ar­ate the “good” part of hu­man de­cisions from the “bad” part.

So what?

It’s reas­on­able to take the at­ti­tude “Well, we’ll deal with that prob­lem when it comes up.” But I think that there are a few things that we can do pro­duct­ively in ad­vance.

  • In­verse re­in­force­ment learn­ing /​ goal in­fer­ence re­search mo­tiv­ated by ap­plic­a­tions to AI con­trol should prob­ably pay par­tic­u­lar at­ten­tion to the is­sue of mod­el­ing mis­takes, and to the chal­lenges that arise when try­ing to find a policy bet­ter than the one you are learn­ing from.

  • It’s worth do­ing more the­or­et­ical re­search to un­der­stand this kind of dif­fi­culty and how to ad­dress it. This re­search can help identify other prac­tical ap­proaches to AI con­trol, which can then be ex­plored em­pir­ic­ally.

The next post in the Value Learn­ing se­quence will be ‘Hu­mans can be as­signed any val­ues what­so­ever…’ by Stu­art Arm­strong, and will post on Monday 5th Novem­ber.

To­mor­row’s AI Align­ment Forum se­quences post will be ‘Ro­bust Deleg­a­tion’, in the Embed­ded Agency se­quence.