Model Mis-specification and Inverse Reinforcement Learning

Posted as part of the AI Align­ment Fo­rum se­quence on Value Learn­ing.

Ro­hin’s note: While I mo­ti­vated the last post with an ex­am­ple of us­ing a spe­cific model for hu­man bi­ases, in this post (origi­nal here), Ja­cob Stein­hardt and Owain Evans point out that model mis-speci­fi­ca­tion can arise in other parts of in­verse re­in­force­ment learn­ing as well. The ar­gu­ments here con­sider some more prac­ti­cal con­cerns (for ex­am­ple, the wor­ries about get­ting only short-term data for each hu­man would not be a prob­lem if you had the en­tire hu­man policy).

In my pre­vi­ous post, “La­tent Vari­ables and Model Mis-speci­fi­ca­tion”, I ar­gued that while ma­chine learn­ing is good at op­ti­miz­ing ac­cu­racy on ob­served sig­nals, it has less to say about cor­rectly in­fer­ring the val­ues for un­ob­served vari­ables in a model. In this post I’d like to fo­cus in on a spe­cific con­text for this: in­verse re­in­force­ment learn­ing (Ng et al. 2000, Abbeel et al. 2004, Zie­bart et al. 2008, Ho et al 2016), where one ob­serves the ac­tions of an agent and wants to in­fer the prefer­ences and be­liefs that led to those ac­tions. For this post, I am pleased to be joined by Owain Evans, who is an ac­tive re­searcher in this area and has co-au­thored an on­line book about build­ing mod­els of agents (see here in par­tic­u­lar for a tu­to­rial on in­verse re­in­force­ment learn­ing and in­verse plan­ning).

Owain and I are par­tic­u­larly in­ter­ested in in­verse re­in­force­ment learn­ing (IRL) be­cause it has been pro­posed (most no­tably by Stu­art Rus­sell) as a method for learn­ing hu­man val­ues in the con­text of AI safety; among other things, this would even­tu­ally in­volve learn­ing and cor­rectly im­ple­ment­ing hu­man val­ues by ar­tifi­cial agents that are much more pow­er­ful, and act with much broader scope, than any hu­mans al­ive to­day. While we think that over­all IRL is a promis­ing route to con­sider, we be­lieve that there are also a num­ber of non-ob­vi­ous pit­falls re­lated to perform­ing IRL with a mis-speci­fied model. The role of IRL in AI safety is to in­fer hu­man val­ues, which are rep­re­sented by a re­ward func­tion or util­ity func­tion. But cru­cially, hu­man val­ues (or hu­man re­ward func­tions) are never di­rectly ob­served.

Below, we elab­o­rate on these is­sues. We hope that by be­ing more aware of these is­sues, re­searchers work­ing on in­verse re­in­force­ment learn­ing can an­ti­ci­pate and ad­dress the re­sult­ing failure modes. In ad­di­tion, we think that con­sid­er­ing is­sues caused by model mis-speci­fi­ca­tion in a par­tic­u­lar con­crete con­text can bet­ter elu­ci­date the gen­eral is­sues pointed to in the pre­vi­ous post on model mis-speci­fi­ca­tion.

Spe­cific Pit­falls for In­verse Re­in­force­ment Learning

In “La­tent Vari­ables and Model Mis-speci­fi­ca­tion”, Ja­cob talked about model mis-speci­fi­ca­tion, where the “true” model does not lie in the model fam­ily be­ing con­sid­ered. We en­courage read­ers to read that post first, though we’ve also tried to make the be­low read­able in­de­pen­dently.

In the con­text of in­verse re­in­force­ment learn­ing, one can see some spe­cific prob­lems that might arise due to model mis-speci­fi­ca­tion. For in­stance, the fol­low­ing are things we could mi­s­un­der­stand about an agent, which would cause us to make in­cor­rect in­fer­ences about the agent’s val­ues:

  • The ac­tions of the agent. If we be­lieve that an agent is ca­pa­ble of tak­ing a cer­tain ac­tion, but in re­al­ity they are not, we might make strange in­fer­ences about their val­ues (for in­stance, that they highly value not tak­ing that ac­tion). Fur­ther­more, if our data is e.g. videos of hu­man be­hav­ior, we have an ad­di­tional in­fer­ence prob­lem of rec­og­niz­ing ac­tions from the frames.

  • The in­for­ma­tion available to the agent. If an agent has ac­cess to more in­for­ma­tion than we think it does, then a plan that seems ir­ra­tional to us (from the per­spec­tive of a given re­ward func­tion) might ac­tu­ally be op­ti­mal for rea­sons that we fail to ap­pre­ci­ate. In the other di­rec­tion, if an agent has less in­for­ma­tion than we think, then we might in­cor­rectly be­lieve that they don’t value some out­come A, even though they re­ally only failed to ob­tain A due to lack of in­for­ma­tion.

  • The long-term plans of the agent. An agent might take many ac­tions that are use­ful in ac­com­plish­ing some long-term goal, but not nec­es­sar­ily over the time hori­zon that we ob­serve the agent. In­fer­ring cor­rect val­ues thus also re­quires in­fer­ring such long-term goals. In ad­di­tion, long time hori­zons can make mod­els more brit­tle, thereby ex­ac­er­bat­ing model mis-speci­fi­ca­tion is­sues.

There are likely other sources of er­ror as well. The gen­eral point is that, given a mis-speci­fied model of the agent, it is easy to make in­cor­rect in­fer­ences about an agent’s val­ues if the op­ti­miza­tion pres­sure on the learn­ing al­gorithm is only to­wards pre­dict­ing ac­tions cor­rectly in-sam­ple.

In the re­main­der of this post, we will cover each of the above as­pects — ac­tions, in­for­ma­tion, and plans — in turn, giv­ing both quan­ti­ta­tive mod­els and qual­i­ta­tive ar­gu­ments for why model mis-speci­fi­ca­tion for that as­pect of the agent can lead to per­verse be­liefs and be­hav­ior. First, though, we will briefly re­view the defi­ni­tion of in­verse re­in­force­ment learn­ing and in­tro­duce rele­vant no­ta­tion.

In­verse Re­in­force­ment Learn­ing: Defi­ni­tion and Notations

In in­verse re­in­force­ment learn­ing, we want to model an agent tak­ing ac­tions in a given en­vi­ron­ment. We there­fore sup­pose that we have a state space (the set of states the agent and en­vi­ron­ment can be in), an ac­tion space (the set of ac­tions the agent can take), and a tran­si­tion func­tion , which gives the prob­a­bil­ity of mov­ing from state to state when tak­ing ac­tion . For in­stance, for an AI learn­ing to con­trol a car, the state space would be the pos­si­ble lo­ca­tions and ori­en­ta­tions of the car, the ac­tion space would be the set of con­trol sig­nals that the AI could send to the car, and the tran­si­tion func­tion would be the dy­nam­ics model for the car. The tu­ple of is called an , which is a Markov De­ci­sion Pro­cess with­out a re­ward func­tion. (The will ei­ther have a known hori­zon or a dis­count rate but we’ll leave these out for sim­plic­ity.)

Figure 1: Di­a­gram show­ing how IRL and RL are re­lated. (Credit: Pieter Abbeel’s slides on IRL)

The in­fer­ence prob­lem for IRL is to in­fer a re­ward func­tion given an op­ti­mal policy for the (see Figure 1). We learn about the policy from sam­ples of states and the cor­re­spond­ing ac­tion ac­cord­ing to (which may be ran­dom). Typ­i­cally, these sam­ples come from a tra­jec­tory, which records the full his­tory of the agent’s states and ac­tions in a sin­gle epi­sode:

In the car ex­am­ple, this would cor­re­spond to the ac­tions taken by an ex­pert hu­man driver who is demon­strat­ing de­sired driv­ing be­havi­our (where the ac­tions would be recorded as the sig­nals to the steer­ing wheel, brake, etc.).

Given the and the ob­served tra­jec­tory, the goal is to in­fer the re­ward func­tion . In a Bayesian frame­work, if we spec­ify a prior on we have:

The like­li­hood is just , where is the op­ti­mal policy un­der the re­ward func­tion . Note that com­put­ing the op­ti­mal policy given the re­ward is in gen­eral non-triv­ial; ex­cept in sim­ple cases, we typ­i­cally ap­prox­i­mate the policy us­ing re­in­force­ment learn­ing (see Figure 1). Poli­cies are usu­ally as­sumed to be noisy (e.g. us­ing a soft­max in­stead of de­ter­minis­ti­cally tak­ing the best ac­tion). Due to the challenges of spec­i­fy­ing pri­ors, com­put­ing op­ti­mal poli­cies and in­te­grat­ing over re­ward func­tions, most work in IRL uses some kind of ap­prox­i­ma­tion to the Bayesian ob­jec­tive (see the refer­ences in the in­tro­duc­tion for some ex­am­ples).

Rec­og­niz­ing Hu­man Ac­tions in Data

IRL is a promis­ing ap­proach to learn­ing hu­man val­ues in part be­cause of the easy availa­bil­ity of data. For su­per­vised learn­ing, hu­mans need to pro­duce many la­beled in­stances spe­cial­ized for a task. IRL, by con­trast, is an un­su­per­vised/​semi-su­per­vised ap­proach where any record of hu­man be­hav­ior is a po­ten­tial data source. Face­book’s logs of user be­hav­ior provide trillions of data-points. YouTube videos, his­tory books, and liter­a­ture are a trove of data on hu­man be­hav­ior in both ac­tual and imag­ined sce­nar­ios. How­ever, while there is lots of ex­ist­ing data that is in­for­ma­tive about hu­man prefer­ences, we ar­gue that ex­ploit­ing this data for IRL will be a difficult, com­plex task with cur­rent tech­niques.

In­fer­ring Re­ward Func­tions from Video Frames

As we noted above, ap­pli­ca­tions of IRL typ­i­cally in­fer the re­ward func­tion R from ob­served sam­ples of the hu­man policy . For­mally, the en­vi­ron­ment is a known and the ob­ser­va­tions are state-ac­tion pairs, . This as­sumes that (a) the en­vi­ron­ment’s dy­nam­ics are given as part of the IRL prob­lem, and (b) the ob­ser­va­tions are struc­tured as “state-ac­tion” pairs. When the data comes from a hu­man ex­pert park­ing a car, these as­sump­tions are rea­son­able. The states and ac­tions of the driver can be recorded and a car simu­la­tor can be used for . For data from YouTube videos or his­tory books, the as­sump­tions fail. The data is a se­quence of par­tial ob­ser­va­tions: the tran­si­tion func­tion is un­known and the data does not sep­a­rate out state and ac­tion. In­deed, it’s a challeng­ing ML prob­lem to in­fer hu­man ac­tions from text or videos.

Movie still: What ac­tions are be­ing performed in this situ­a­tion? (Source)

As a con­crete ex­am­ple, sup­pose the data is a video of two co-pi­lots fly­ing a plane. The suc­ces­sive frames provide only limited in­for­ma­tion about the state of the world at each time step and the frames of­ten jump for­ward in time. So it’s more like a POMDP with a com­plex ob­ser­va­tion model. More­over, the ac­tions of each pi­lot need to be in­ferred. This is a challeng­ing in­fer­ence prob­lem, be­cause ac­tions can be sub­tle (e.g. when a pi­lot nudges the con­trols or nods to his co-pi­lot).

To in­fer ac­tions from ob­ser­va­tions, some model re­lat­ing the true state-ac­tion to the ob­served video frame must be used. But choos­ing any model makes sub­stan­tive as­sump­tions about how hu­man val­ues re­late to their be­hav­ior. For ex­am­ple, sup­pose some­one at­tacks one of the pi­lots and (as a re­flex) he defends him­self by hit­ting back. Is this re­flex­ive or in­stinc­tive re­sponse (hit­ting the at­tacker) an ac­tion that is in­for­ma­tive about the pi­lot’s val­ues? Philoso­phers and neu­ro­scien­tists might in­ves­ti­gate this by con­sid­er­ing the men­tal pro­cesses that oc­cur be­fore the pi­lot hits back. If an IRL al­gorithm uses an off-the-shelf ac­tion clas­sifier, it will lock in some (con­tentious) as­sump­tions about these men­tal pro­cesses. At the same time, an IRL al­gorithm can­not learn such a model be­cause it never di­rectly ob­serves the men­tal pro­cesses that re­late re­wards to ac­tions.

In­fer­ring Poli­cies From Video Frames

When learn­ing a re­ward func­tion via IRL, the ul­ti­mate goal is to use the re­ward func­tion to guide an ar­tifi­cial agent’s be­hav­ior (e.g. to perform use­ful tasks to hu­mans). This goal can be for­mal­ized di­rectly, with­out in­clud­ing IRL as an in­ter­me­di­ate step. For ex­am­ple, in Ap­pren­tice­ship Learn­ing, the goal is to learn a “good” policy for the from sam­ples of the hu­man’s policy (where is as­sumed to ap­prox­i­mately op­ti­mize an un­known re­ward func­tion). In Imi­ta­tion Learn­ing, the goal is sim­ply to learn a policy that is similar to the hu­man’s policy.

Like IRL, policy search tech­niques need to rec­og­nize an agent’s ac­tions to in­fer their policy. So they have the same challenges as IRL in learn­ing from videos or his­tory books. Un­like IRL, policy search does not ex­plic­itly model the re­ward func­tion that un­der­lies an agent’s be­hav­ior. This leads to an ad­di­tional challenge. Hu­mans and AI sys­tems face vastly differ­ent tasks and have differ­ent ac­tion spaces. Most ac­tions in videos and books would never be performed by a soft­ware agent. Even when tasks are similar (e.g. hu­mans driv­ing in the 1930s vs. a self-driv­ing car in 2016), it is a difficult trans­fer learn­ing prob­lem to use hu­man poli­cies in one task to im­prove AI poli­cies in an­other.

IRL Needs Cu­rated Data

We ar­gued that records of hu­man be­havi­our in books and videos are difficult for IRL al­gorithms to ex­ploit. Data from Face­book seems more promis­ing: we can store the state (e.g. the HTML or pix­els dis­played to the hu­man) and each hu­man ac­tion (clicks and scrol­ling). This ex­tends be­yond Face­book to any task that can be performed on a com­puter. While this cov­ers a broad range of tasks, there are ob­vi­ous limi­ta­tions. Many peo­ple in the world have a limited abil­ity to use a com­puter: we can’t learn about their val­ues in this way. More­over, some kinds of hu­man prefer­ences (e.g. prefer­ences over phys­i­cal ac­tivi­ties) seem hard to learn about from be­havi­our on a com­puter.

In­for­ma­tion and Biases

Hu­man ac­tions de­pend both on their prefer­ences and their be­liefs. The be­liefs, like the prefer­ences, are never di­rectly ob­served. For nar­row tasks (e.g. peo­ple choos­ing their fa­vorite pho­tos from a dis­play), we can model hu­mans as hav­ing full knowl­edge of the state (as in an MDP). But for most real-world tasks, hu­mans have limited in­for­ma­tion and their in­for­ma­tion changes over time (as in a POMDP or RL prob­lem). If IRL as­sumes the hu­man has full in­for­ma­tion, then the model is mis-speci­fied and gen­er­al­iz­ing about what the hu­man would pre­fer in other sce­nar­ios can be mis­taken. Here are some ex­am­ples:

  • Some­one trav­els from their house to a cafe, which has already closed. If they are as­sumed to have full knowl­edge, then IRL would in­fer an al­ter­na­tive prefer­ence (e.g. go­ing for a walk) rather than a prefer­ence to get a drink at the cafe.

  • Some­one takes a drug that is widely known to be in­effec­tive. This could be be­cause they have a false be­lief that the drug is effec­tive, or be­cause they picked up the wrong pill, or be­cause they take the drug for its side-effects. Each pos­si­ble ex­pla­na­tion could lead to differ­ent con­clu­sions about prefer­ences.

  • Sup­pose an IRL al­gorithm is in­fer­ring a per­son’s goals from key-presses on their lap­top. The per­son re­peat­edly for­gets their lo­gin pass­words and has to re­set them. This be­hav­ior is hard to cap­ture with a POMDP-style model: hu­mans for­get some strings of char­ac­ters and not oth­ers. IRL might in­fer that the per­son in­tends to re­peat­edly re­set their pass­words.

The above arises from hu­mans for­get­ting in­for­ma­tion — even if the in­for­ma­tion is only a short string of char­ac­ters. This is one way in which hu­mans sys­tem­at­i­cally de­vi­ate from ra­tio­nal Bayesian agents. The field of psy­chol­ogy has doc­u­mented many other de­vi­a­tions. Below we dis­cuss one such de­vi­a­tion — time-in­con­sis­tency — which has been used to ex­plain temp­ta­tion, ad­dic­tion and pro­cras­ti­na­tion.

Time-in­con­sis­tency and Procrastination

An IRL al­gorithm is in­fer­ring Alice’s prefer­ences. In par­tic­u­lar, the goal is to in­fer Alice’s prefer­ence for com­plet­ing a some­what te­dious task (e.g. writ­ing a pa­per) as op­posed to re­lax­ing. Alice has days in which she could com­plete the task and IRL ob­serves her work­ing or re­lax­ing on each suc­ces­sive day.

Figure 2. MDP graph for choos­ing whether to “work” or “wait” (re­lax) on a task.

For­mally, let R be the prefer­ence/​re­ward Alice as­signs to com­plet­ing the task. Each day, Alice can “work” (re­ceiv­ing cost for do­ing te­dious work) or “wait” (cost ). If she works, she later re­ceives the re­ward minus a tiny, lin­early in­creas­ing cost (be­cause it’s bet­ter to sub­mit a pa­per ear­lier). Beyond the dead­line at , Alice can­not get the re­ward . For IRL, we fix and and in­fer .

Sup­pose Alice chooses “wait” on Day 1. If she were fully ra­tio­nal, it fol­lows that R (the prefer­ence for com­plet­ing the task) is small com­pared to (the psy­cholog­i­cal cost of do­ing the te­dious work). In other words, Alice doesn’t care much about com­plet­ing the task. Ra­tional agents will do the task on Day 1 or never do it. Yet hu­mans of­ten care deeply about tasks yet leave them un­til the last minute (when finish­ing early would be op­ti­mal). Here we imag­ine that Alice has 9 days to com­plete the task and waits un­til the last pos­si­ble day.

Figure 3: Graph show­ing IRL in­fer­ences for Op­ti­mal model (which is mis-speci­fied) and Pos­si­bly Dis­count­ing Model (which in­cludes hy­per­bolic dis­count­ing). On each day (axis) the model gets an­other ob­ser­va­tion of Alice’s choice. The axis shows the pos­te­rior mean for (re­ward for task), where the te­dious work .

Figure 3 shows re­sults from run­ning IRL on this prob­lem. There is an “Op­ti­mal” model, where the agent is op­ti­mal up to an un­known level of soft­max ran­dom noise (a typ­i­cal as­sump­tion for IRL). There is also a “Pos­si­bly Dis­count­ing” model, where the agent is ei­ther soft­max op­ti­mal or is a hy­per­bolic dis­counter (with un­known level of dis­count­ing). We do joint Bayesian in­fer­ence over the com­ple­tion re­ward , the soft­max noise and (for “Pos­si­bly Dis­count­ing”) how much the agent hy­per­bol­i­cally dis­counts. The work cost is set to . Figure 3 shows that af­ter 6 days of ob­serv­ing Alice pro­cras­ti­nate, the “Op­ti­mal” model is very con­fi­dent that Alice does not care about the task . When Alice com­pletes the task on the last pos­si­ble day, the pos­te­rior mean on R is not much more than the prior mean. By con­trast, the “Pos­si­bly Dis­count­ing” model never be­comes con­fi­dent that Alice doesn’t care about the task. (Note that the gap be­tween the mod­els would be big­ger for larger . The “Op­ti­mal” model’s pos­te­rior on R shoots back to its Day-0 prior be­cause it ex­plains the whole ac­tion se­quence as due to high soft­max noise — op­ti­mal agents with­out noise would ei­ther do the task im­me­di­ately or not at all. Full de­tails and code are here.)

Long-term Plans

Agents will of­ten take long se­ries of ac­tions that gen­er­ate nega­tive util­ity for them in the mo­ment in or­der to ac­com­plish a long-term goal (for in­stance, study­ing ev­ery night in or­der to perform well on a test). Such long-term plans can make IRL more difficult for a few rea­sons. Here we fo­cus on two: (1) IRL sys­tems may not have ac­cess to the right type of data for learn­ing about long-term goals, and (2) need­ing to pre­dict long se­quences of ac­tions can make al­gorithms more frag­ile in the face of model mis-speci­fi­ca­tion.

(1) Wrong type of data. To make in­fer­ences based on long-term plans, it would be helpful to have co­her­ent data about a sin­gle agent’s ac­tions over a long pe­riod of time (so that we can e.g. see the plan un­fold­ing). But in prac­tice we will likely have sub­stan­tially more data con­sist­ing of short snap­shots of a large num­ber of differ­ent agents (e.g. be­cause many in­ter­net ser­vices already record user in­ter­ac­tions, but it is un­com­mon for a sin­gle per­son to be ex­haus­tively tracked and recorded over an ex­tended pe­riod of time even while they are offline).

The former type of data (about a sin­gle rep­re­sen­ta­tive pop­u­la­tion mea­sured over time) is called panel data, while the lat­ter type of data (about differ­ent rep­re­sen­ta­tive pop­u­la­tions mea­sured at each point in time) is called re­peated cross-sec­tion data. The differ­ences be­tween these two types of data is well-stud­ied in econo­met­rics, and a gen­eral theme is the fol­low­ing: it is difficult to in­fer in­di­vi­d­ual-level effects from cross-sec­tional data.

An easy and fa­mil­iar ex­am­ple of this differ­ence (albeit not in an IRL set­ting) can be given in terms of elec­tion cam­paigns. Most cam­paign pol­ling is cross-sec­tional in na­ture: a differ­ent pop­u­la­tion of re­spon­dents is pol­led at each point in time. Sup­pose that Hillary Clin­ton gives a speech and her over­all sup­port ac­cord­ing to cross-sec­tional polls in­creases by 2%; what can we con­clude from this? Does it mean that 2% of peo­ple switched from Trump to Clin­ton? Or did 6% of peo­ple switch from Trump to Clin­ton while 4% switched from Clin­ton to Trump?

At a min­i­mum, then, us­ing cross-sec­tional data leads to a difficult dis­ag­gre­ga­tion prob­lem; for in­stance, differ­ent agents tak­ing differ­ent ac­tions at a given point in time could be due to be­ing at differ­ent stages in the same plan, or due to hav­ing differ­ent plans, or some com­bi­na­tion of these and other fac­tors. Col­lect­ing de­mo­graphic and other side data can help us (by al­low­ing us to look at vari­a­tion and shifts within each sub­pop­u­la­tion), but it is un­clear if this will be suffi­cient in gen­eral.

On the other hand, there are some ser­vices (such as Face­book or Google) that do have ex­ten­sive data about in­di­vi­d­ual users across a long pe­riod of time. How­ever, this data has an­other is­sue: it is in­com­plete in a very sys­tem­atic way (since it only tracks on­line be­havi­our). For in­stance, some­one might go on­line most days to read course notes and Wikipe­dia for a class; this is data that would likely be recorded. How­ever, it is less likely that one would have a record of that per­son tak­ing the fi­nal exam, pass­ing the class and then get­ting an in­tern­ship based on their class perfor­mance. Of course, some pieces of this se­quence would be in­fer­able based on some peo­ple’s e-mail records, etc., but it would likely be un­der-rep­re­sented in the data rel­a­tive to the record of Wikipe­dia us­age. In ei­ther case, some non-triv­ial de­gree of in­fer­ence would be nec­es­sary to make sense of such data.

(2) Frag­ility to mis-speci­fi­ca­tion. Above we dis­cussed why ob­serv­ing only short se­quences of ac­tions from an agent can make it difficult to learn about their long-term plans (and hence to rea­son cor­rectly about their val­ues). Next we dis­cuss an­other po­ten­tial is­sue — frag­ility to model mis-speci­fi­ca­tion.

Sup­pose some­one spends 99 days do­ing a bor­ing task to ac­com­plish an im­por­tant goal on day 100. A sys­tem that is only try­ing to cor­rectly pre­dict ac­tions will be right 99% of the time if it pre­dicts that the per­son in­her­ently en­joys bor­ing tasks. Of course, a sys­tem that un­der­stands the goal and how the tasks lead to the goal will be right 100% of the time, but even minor er­rors in its un­der­stand­ing could bring the ac­cu­racy back be­low 99%.

The gen­eral is­sue is the fol­low­ing: large changes in the model of the agent might only lead to small changes in the pre­dic­tive ac­cu­racy of the model, and the longer the time hori­zon on which a goal is re­al­ized, the more this might be the case. This means that even slight mis-speci­fi­ca­tions in the model could tip the scales back in fa­vor of a (very) in­cor­rect re­ward func­tion. A po­ten­tial way of deal­ing with this might be to iden­tify “im­por­tant” pre­dic­tions that seem closely tied to the re­ward func­tion, and fo­cus par­tic­u­larly on get­ting those pre­dic­tions right (see here for a pa­per ex­plor­ing a similar idea in the con­text of ap­prox­i­mate in­fer­ence).

One might ob­ject that this is only a prob­lem in this toy set­ting; for in­stance, in the real world, one might look at the par­tic­u­lar way in which some­one is study­ing or perform­ing some other bor­ing task to see that it co­her­ently leads to­wards some goal (in a way that would be less likely were the per­son to be do­ing some­thing bor­ing purely for en­joy­ment). In other words, cor­rectly un­der­stand­ing the agent’s goals might al­low for more fine-grained ac­cu­rate pre­dic­tions which would fare bet­ter un­der e.g. log-score than would an in­cor­rect model.

This is a rea­son­able ob­jec­tion, but there are some his­tor­i­cal ex­am­ples of this go­ing wrong that should give one pause. That is, there are his­tor­i­cal in­stances where: (i) peo­ple ex­pected a more com­plex model that seemed to get at some un­der­ly­ing mechanism to out­perform a sim­pler model that ig­nored that mechanism, and (ii) they were wrong (the sim­pler model did bet­ter un­der log-score). The ex­am­ple we are most fa­mil­iar with is n-gram mod­els vs. parse trees for lan­guage mod­el­ling; the most suc­cess­ful lan­guage mod­els (in terms of hav­ing the best log-score on pre­dict­ing the next word given a se­quence of pre­vi­ous words) es­sen­tially treat lan­guage as a high-or­der Markov chain or hid­den Markov model, de­spite the fact that lin­guis­tic the­ory pre­dicts that lan­guage should be tree-struc­tured rather than lin­early-struc­tured. In­deed, NLP re­searchers have tried build­ing lan­guage mod­els that as­sume lan­guage is tree-struc­tured, and these mod­els perform worse, or at least do not seem to have been adopted in prac­tice (this is true both for older dis­crete mod­els and newer con­tin­u­ous mod­els based on neu­ral nets). It’s plau­si­ble that a similar is­sue will oc­cur in in­verse re­in­force­ment learn­ing, where cor­rectly in­fer­ring plans is not enough to win out in pre­dic­tive perfor­mance. The rea­son for the two is­sues might be quite similar (in lan­guage mod­el­ling, the tree struc­ture only wins out in statis­ti­cally un­com­mon cor­ner cases in­volv­ing long-term and/​or nested de­pen­den­cies, and hence get­ting that part of the pre­dic­tion cor­rect doesn’t help pre­dic­tive ac­cu­racy much).

The over­all point is: in the case of even slight model mis-speci­fi­ca­tion, the “cor­rect” model might ac­tu­ally perform worse un­der typ­i­cal met­rics such as pre­dic­tive ac­cu­racy. There­fore, more care­ful meth­ods of con­struct­ing a model might be nec­es­sary.

Learn­ing Values != Ro­bustly Pre­dict­ing Hu­man Behaviour

The prob­lems with IRL de­scribed so far will re­sult in poor perfor­mance for pre­dict­ing hu­man choices out-of-sam­ple. For ex­am­ple, if some­one is ob­served do­ing bor­ing tasks for 99 days (where they only achieve the goal on Day 100), they’ll be pre­dicted to con­tinue do­ing bor­ing tasks even when a short-cut to the goal be­comes available. So even if the goal is sim­ply to pre­dict hu­man be­havi­our (not to in­fer hu­man val­ues), mis-speci­fi­ca­tion leads to bad pre­dic­tions on re­al­is­tic out-of-sam­ple sce­nar­ios.

Let’s sup­pose that our goal is not to pre­dict hu­man be­havi­our but to cre­ate AI sys­tems that pro­mote and re­spect hu­man val­ues. Th­ese goals (pre­dict­ing hu­mans and build­ing safe AI) are dis­tinct. Here’s an ex­am­ple that illus­trates the differ­ence. Con­sider a long-term smoker, Bob, who would con­tinue smok­ing even if there were (coun­ter­fac­tu­ally) a uni­ver­sally effec­tive anti-smok­ing treat­ment. Maybe Bob is in de­nial about the health effects of smok­ing or Bob thinks he’ll in­evitably go back to smok­ing what­ever hap­pens. If an AI sys­tem were as­sist­ing Bob, we might ex­pect it to avoid pro­mot­ing his smok­ing habit (e.g. by not offer­ing him cigarettes at ran­dom mo­ments). This is not pa­ter­nal­ism, where the AI sys­tem im­poses some­one else’s val­ues on Bob. The point is that even if Bob would con­tinue smok­ing across many coun­ter­fac­tual sce­nar­ios this doesn’t mean that he places value on smok­ing.

How do we choose be­tween the the­ory that Bob val­ues smok­ing and the the­ory that he does not (but smokes any­way be­cause of the pow­er­ful ad­dic­tion)? Hu­mans choose be­tween these the­o­ries based on our ex­pe­rience with ad­dic­tive be­havi­ours and our in­sights into peo­ple’s prefer­ences and val­ues. This kind of in­sight can’t eas­ily be cap­tured as for­mal as­sump­tions about a model, or even as a crite­rion about coun­ter­fac­tual gen­er­al­iza­tion. (The the­ory that Bob val­ues smok­ing does make ac­cu­rate pre­dic­tions across a wide range of coun­ter­fac­tu­als.) Be­cause of this, learn­ing hu­man val­ues from IRL has a more profound kind of model mis-speci­fi­ca­tion than the ex­am­ples in Ja­cob’s pre­vi­ous post. Even in the limit of data gen­er­ated from an in­finite se­ries of ran­dom coun­ter­fac­tual sce­nar­ios, stan­dard IRL al­gorithms would not in­fer some­one’s true val­ues.

Pre­dict­ing hu­man ac­tions is nei­ther nec­es­sary nor suffi­cient for learn­ing hu­man val­ues. In what ways, then, are the two re­lated? One such way stems from the premise that if some­one spends more re­sources mak­ing a de­ci­sion, the re­sult­ing de­ci­sion tends to be more in keep­ing with their true val­ues. For in­stance, some­one might spend lots of time think­ing about the de­ci­sion, they might con­sult ex­perts, or they might try out the differ­ent op­tions in a trial pe­riod be­fore they make the real de­ci­sion. Var­i­ous au­thors have thus sug­gested that peo­ple’s choices un­der suffi­cient “re­flec­tion” act as a re­li­able in­di­ca­tor of their true val­ues. Un­der this view, pre­dict­ing a cer­tain kind of be­havi­our (choices un­der re­flec­tion) is suffi­cient for learn­ing hu­man val­ues. Paul Chris­ti­ano has writ­ten about some pro­pos­als for do­ing this, though we will not dis­cuss them here (the first link is for gen­eral AI sys­tems while the sec­ond is for news­feeds). In gen­eral, turn­ing these ideas into al­gorithms that are tractable and learn safely re­mains a challeng­ing prob­lem.

Fur­ther reading

There is re­search on do­ing IRL for agents in POMDPs. Owain and col­lab­o­ra­tors ex­plored the effects of limited in­for­ma­tion and cog­ni­tive bi­ases on IRL: pa­per, pa­per, on­line book.

For many en­vi­ron­ments it will not be pos­si­ble to iden­tify the re­ward func­tion from the ob­served tra­jec­to­ries. Th­ese iden­ti­fi­ca­tion prob­lems are re­lated to the mis-speci­fi­ca­tion prob­lems but are not the same thing. Ac­tive learn­ing can help with iden­ti­fi­ca­tion (pa­per).

Paul Chris­ti­ano raised many similar points about mis-speci­fi­ca­tion in a post on his blog.

For a big-pic­ture mono­graph on re­la­tions be­tween hu­man prefer­ences, eco­nomic util­ity the­ory and welfare/​well-be­ing, see Haus­man’s “Prefer­ence, Value, Choice and Welfare”.


Thanks to Sindy Li for re­view­ing a full draft of this post and pro­vid­ing many helpful com­ments. Thanks also to Michael Webb and Paul Chris­ti­ano for do­ing the same on spe­cific sec­tions of the post.