# Learning biases and rewards simultaneously

I’ve fi­nally up­loaded to arXiv our work on in­fer­ring hu­man bi­ases alongside IRL, which was pub­lished at ICML 2019.

### Sum­mary of the paper

#### The IRL Debate

Here’s a quick tour of the de­bate about in­verse re­in­force­ment learn­ing (IRL) and cog­ni­tive bi­ases, fea­tur­ing many of the ideas from the first chap­ter of the Value Learn­ing se­quence:

I had the in­tu­ition that the im­pos­si­bil­ity the­o­rem was like the other no-free-lunch the­o­rems in ML: not ac­tu­ally rele­vant for what ML could do in prac­tice. So we tried to learn and cor­rect for sys­tem­atic bi­ases in IRL.

#### The idea be­hind the algorithms

The ba­sic idea was to learn the plan­ning al­gorithm by which the hu­man pro­duces demon­stra­tions, and try to en­sure that the plan­ning al­gorithm cap­tured the ap­pro­pri­ate sys­tem­atic bi­ases. We used a Value Iter­a­tion Net­work to give an in­duc­tive bias to­wards “plan­ners” but oth­er­wise did not as­sume any­thing about the form of the sys­tem­atic bias. [1] Then, we could perform IRL by figur­ing out which re­ward would cause the plan­ning al­gorithm to out­put the given demon­stra­tions. The re­ward would be “de­bi­ased” be­cause the effect of the bi­ases on the policy would already be ac­counted for in the plan­ning al­gorithm.

How could we learn the plan­ning al­gorithm? Well, one baseline method is to as­sume that we have ac­cess to some tasks where the re­wards are known, and use those tasks to learn what the plan­ning al­gorithm is. Then, once that is learned, we can in­fer the re­wards for new tasks that we haven’t seen be­fore. This re­quires the plan­ner to gen­er­al­ize across tasks.

How­ever, it’s kind of cheat­ing to as­sume ac­cess to ground truth re­wards, since we usu­ally wouldn’t have them. What if we learned the plan­ning al­gorithm and re­wards si­mul­ta­neously? Well, the no-free-lunch the­o­rem gets us then: max­i­miz­ing the true re­ward and min­i­miz­ing the nega­tive of the true re­ward would lead to the same policy, and so you can’t dis­t­in­guish be­tween them, and so the out­put of your IRL al­gorithm could be the true re­ward or the nega­tive of the true re­ward. It would be re­ally bad if our IRL al­gorithm said ex­actly the op­po­site of what we want. But surely we can at least as­sume that hu­mans are not ex­pected util­ity min­i­miz­ers in or­der to elimi­nate this pos­si­bil­ity.

So, we make the as­sump­tion that the hu­man is “near-op­ti­mal”. We ini­tial­ize the plan­ning al­gorithm to be op­ti­mal, and then op­ti­mize for a plan­ning al­gorithm that is “near” the op­ti­mal plan­ner, in gra­di­ent-de­scent-space, that com­bined with the (learned) re­ward func­tion ex­plains the demon­stra­tions. You might think that a min­i­mizer is in fact “near” a max­i­mizer; em­piri­cally this didn’t turn out to be the case, but I don’t have a par­tic­u­larly com­pel­ling rea­son why that hap­pened.

#### Results

Here’s the graph from our pa­per, show­ing the perfor­mance of var­i­ous al­gorithms on some simu­lated hu­man bi­ases (higher = bet­ter). Both of our al­gorithms get ac­cess to the simu­lated hu­man poli­cies on mul­ti­ple tasks. Al­gorithm 1 is the one that gets ac­cess to ground-truth re­wards for some tasks, while Al­gorithm 2 is the one that in­stead tries to en­sure that the learned plan­ner is “near” the op­ti­mal plan­ner. “Boltz­mann” and “Op­ti­mal” mean that the al­gorithm as­sumes that the hu­man is Boltz­mann ra­tio­nal and op­ti­mal re­spec­tively.

Our al­gorithms work bet­ter on av­er­age, mostly by be­ing ro­bust to the spe­cific kind of bias that the demon­stra­tor had—they tend to perform on par with the bet­ter of the Boltz­mann and Op­ti­mal baseline al­gorithms. Sur­pris­ingly (to me), the sec­ond al­gorithm some­times out­performs the first, even though the first al­gorithm has ac­cess to more data (since it gets ac­cess to the ground truth re­wards in some tasks). This could be be­cause it ex­ploits the as­sump­tion that the demon­stra­tor is near-op­ti­mal, which the first al­gorithm doesn’t do, even though the as­sump­tion is cor­rect for most of the mod­els we test. On the other hand, maybe it’s just ran­dom noise.

### Implications

#### Su­per­in­tel­li­gent AI alignment

The most ob­vi­ous way that this is rele­vant to AI al­ign­ment is that it is progress on am­bi­tious value learn­ing, where we try to learn a util­ity func­tion that en­codes all of hu­man val­ues.

“But wait,″ you say, “didn’t you ar­gue that am­bi­tious value learn­ing is un­likely to work?”

Well, yes. At the time that I was do­ing this work, I be­lieved that am­bi­tious value learn­ing was the only op­tion, and seemed hard but not doomed. This was the ob­vi­ous thing to do to try and ad­vance it. But this was over a year ago, the rea­son it’s only now com­ing out is that it took a while to pub­lish the pa­per. (In fact, it pre­dates my state of the world work.) But it’s true that now I’m not very hope­ful about am­bi­tious value learn­ing, and so this pa­per’s con­tri­bu­tion to­wards it doesn’t seem par­tic­u­larly valuable to me. How­ever, a few oth­ers re­main op­ti­mistic about am­bi­tious value learn­ing, and if they’re right, this re­search might be use­ful for that path­way to al­igned AI.

I do think that the pa­per con­tributes to nar­row value learn­ing, and I still think that this very plau­si­bly will be rele­vant to AI al­ign­ment. It’s a par­tic­u­larly di­rect at­tack on the speci­fi­ca­tion prob­lem, with the goal of in­fer­ring a speci­fi­ca­tion that leads to a policy that would out­perform the demon­stra­tor. That said, I am no longer very op­ti­mistic about ap­proaches that re­quire a spe­cific struc­ture (in this case, world mod­els fed into a differ­en­tiable plan­ner with an in­duc­tive bias that then pro­duces ac­tions), and I am also less op­ti­mistic about us­ing ap­proaches that try to mimic ex­pected value calcu­la­tions, rather than try­ing to do some­thing more like norm in­fer­ence.

(How­ever, I still ex­pect that the im­pos­si­bil­ity re­sult in prefer­ence learn­ing will only be a prob­lem in the­ory, not in prac­tice. It’s just that this par­tic­u­lar method of deal­ing with it doesn’t seem like it will work.)

#### Near-term AI issues

In the near term, we will need bet­ter ways than re­ward func­tions to spec­ify the be­hav­ior that we want to an AI sys­tem. In­verse re­in­force­ment learn­ing is prob­a­bly the lead­ing ex­am­ple of how we could do this. How­ever, since the spe­cific al­gorithms re­quire much bet­ter differ­en­tiable plan­ners be­fore they will perform on par with ex­ist­ing al­gorithms, it may be some time be­fore they are use­ful. In ad­di­tion, it’s prob­a­bly bet­ter to use spe­cific bias mod­els in the near term. Over­all, I think these meth­ods or ideas are about as likely to be used in the near term as the av­er­age pa­per (which is to say, not very likely).

1. A Value Iter­a­tion Net­work is a fully differ­en­tiable neu­ral net­work that em­beds an ap­prox­i­mate value iter­a­tion al­gorithm in­side a feed-for­ward clas­sifi­ca­tion net­work. ↩︎

• Planned sum­mary:

Typ­i­cally, in­verse re­in­force­ment learn­ing as­sumes that the demon­stra­tor is op­ti­mal, or that any mis­takes they make are caused by ran­dom noise. Without a model of how the demon­stra­tor makes mis­takes, we should ex­pect that IRL would not be able to out­perform the demon­stra­tor. So, a nat­u­ral ques­tion arises: can we learn the sys­tem­atic mis­takes that the demon­stra­tor makes from data? While there is an im­pos­si­bil­ity re­sult here, we might hope that it is only a prob­lem in the­ory, not in prac­tice.

In this pa­per, my coau­thors and I pro­pose that we learn the cog­ni­tive bi­ases of the demon­stra­tor, by learn­ing their plan­ning al­gorithm. The hope is that the cog­ni­tive bi­ases are en­coded in the learned plan­ning al­gorithm. We can then perform bias-aware IRL by find­ing the re­ward func­tion that when passed into the plan­ning al­gorithm re­sults in the ob­served policy. We have two al­gorithms which do this, one which as­sumes that we know the ground-truth re­wards for some tasks, and one which tries to keep the learned plan­ner “close to” the op­ti­mal plan­ner. In a sim­ple en­vi­ron­ment with simu­lated hu­man bi­ases, the al­gorithms perform bet­ter than the stan­dard IRL as­sump­tions of perfect op­ti­mal­ity or Boltz­mann ra­tio­nal­ity—but they lose a lot of perfor­mance by us­ing an im­perfect differ­en­tiable plan­ner to learn the plan­ning al­gorithm.

Planned opinion:

Although this only got pub­lished re­cently, it’s work I did over a year ago. I’m no longer very op­ti­mistic about am­bi­tious value learn­ing, and so I’m less ex­cited about its im­pact on AI al­ign­ment now. In par­tic­u­lar, it seems un­likely to me that we will need to in­fer all hu­man val­ues perfectly, with­out any edge cases or un­cer­tain­ties, which we then op­ti­mize as far as pos­si­ble. I would in­stead want to build AI sys­tems that start with an ad­e­quate un­der­stand­ing of hu­man prefer­ences, and then learn more over time, in con­junc­tion with op­ti­miz­ing for the prefer­ences they know about. How­ever, this pa­per is more along the former line of work, at least for long-term AI al­ign­ment.

I do think that this is a con­tri­bu­tion to the field of in­verse re­in­force­ment learn­ing—it shows that by us­ing an ap­pro­pri­ate in­duc­tive bias, you can be­come more ro­bust to (cog­ni­tive) bi­ases in your dataset. It’s not clear how far this will gen­er­al­ize, since it was tested on simu­lated bi­ases on sim­ple en­vi­ron­ments, but I’d ex­pect it to have at least a small effect. In prac­tice though, I ex­pect that you’d get bet­ter re­sults by pro­vid­ing more in­for­ma­tion, as in T-REX.

• I like this ex­am­ple of “works in prac­tice but not in the­ory.” Would you as­so­ci­ate “am­bi­tious value learn­ing vs. ad­e­quate value learn­ing” with “works in the­ory vs. doesn’t work in the­ory but works in prac­tice”?

One way that “al­most ra­tio­nal” is much closer to op­ti­mal than “al­most anti-anti-ra­tio­nal” is ye olde dot product, but a more ac­cu­rate de­scrip­tion of this case would in­volve di­vid­ing up the model space into bas­ins of at­trac­tion. Differ­ent train­ing pro­ce­dures will di­vide up the space in differ­ent ways—this is ac­tu­ally sort of the re­verse of a monte carlo simu­la­tion where one of the prop­er­ties you might look for is er­god­ic­ity (even­tu­ally vis­it­ing all points in the space).

• Would you as­so­ci­ate “am­bi­tious value learn­ing vs. ad­e­quate value learn­ing” with “works in the­ory vs. doesn’t work in the­ory but works in prac­tice”?

Po­ten­tially. I think the main ques­tion is whether ad­e­quate value learn­ing will work in prac­tice.