Learning biases and rewards simultaneously

I’ve fi­nally up­loaded to arXiv our work on in­fer­ring hu­man bi­ases alongside IRL, which was pub­lished at ICML 2019.

Sum­mary of the paper

The IRL Debate

Here’s a quick tour of the de­bate about in­verse re­in­force­ment learn­ing (IRL) and cog­ni­tive bi­ases, fea­tur­ing many of the ideas from the first chap­ter of the Value Learn­ing se­quence:



I had the in­tu­ition that the im­pos­si­bil­ity the­o­rem was like the other no-free-lunch the­o­rems in ML: not ac­tu­ally rele­vant for what ML could do in prac­tice. So we tried to learn and cor­rect for sys­tem­atic bi­ases in IRL.

The idea be­hind the algorithms

The ba­sic idea was to learn the plan­ning al­gorithm by which the hu­man pro­duces demon­stra­tions, and try to en­sure that the plan­ning al­gorithm cap­tured the ap­pro­pri­ate sys­tem­atic bi­ases. We used a Value Iter­a­tion Net­work to give an in­duc­tive bias to­wards “plan­ners” but oth­er­wise did not as­sume any­thing about the form of the sys­tem­atic bias. [1] Then, we could perform IRL by figur­ing out which re­ward would cause the plan­ning al­gorithm to out­put the given demon­stra­tions. The re­ward would be “de­bi­ased” be­cause the effect of the bi­ases on the policy would already be ac­counted for in the plan­ning al­gorithm.

How could we learn the plan­ning al­gorithm? Well, one baseline method is to as­sume that we have ac­cess to some tasks where the re­wards are known, and use those tasks to learn what the plan­ning al­gorithm is. Then, once that is learned, we can in­fer the re­wards for new tasks that we haven’t seen be­fore. This re­quires the plan­ner to gen­er­al­ize across tasks.

How­ever, it’s kind of cheat­ing to as­sume ac­cess to ground truth re­wards, since we usu­ally wouldn’t have them. What if we learned the plan­ning al­gorithm and re­wards si­mul­ta­neously? Well, the no-free-lunch the­o­rem gets us then: max­i­miz­ing the true re­ward and min­i­miz­ing the nega­tive of the true re­ward would lead to the same policy, and so you can’t dis­t­in­guish be­tween them, and so the out­put of your IRL al­gorithm could be the true re­ward or the nega­tive of the true re­ward. It would be re­ally bad if our IRL al­gorithm said ex­actly the op­po­site of what we want. But surely we can at least as­sume that hu­mans are not ex­pected util­ity min­i­miz­ers in or­der to elimi­nate this pos­si­bil­ity.

So, we make the as­sump­tion that the hu­man is “near-op­ti­mal”. We ini­tial­ize the plan­ning al­gorithm to be op­ti­mal, and then op­ti­mize for a plan­ning al­gorithm that is “near” the op­ti­mal plan­ner, in gra­di­ent-de­scent-space, that com­bined with the (learned) re­ward func­tion ex­plains the demon­stra­tions. You might think that a min­i­mizer is in fact “near” a max­i­mizer; em­piri­cally this didn’t turn out to be the case, but I don’t have a par­tic­u­larly com­pel­ling rea­son why that hap­pened.

Results

Here’s the graph from our pa­per, show­ing the perfor­mance of var­i­ous al­gorithms on some simu­lated hu­man bi­ases (higher = bet­ter). Both of our al­gorithms get ac­cess to the simu­lated hu­man poli­cies on mul­ti­ple tasks. Al­gorithm 1 is the one that gets ac­cess to ground-truth re­wards for some tasks, while Al­gorithm 2 is the one that in­stead tries to en­sure that the learned plan­ner is “near” the op­ti­mal plan­ner. “Boltz­mann” and “Op­ti­mal” mean that the al­gorithm as­sumes that the hu­man is Boltz­mann ra­tio­nal and op­ti­mal re­spec­tively.



Our al­gorithms work bet­ter on av­er­age, mostly by be­ing ro­bust to the spe­cific kind of bias that the demon­stra­tor had—they tend to perform on par with the bet­ter of the Boltz­mann and Op­ti­mal baseline al­gorithms. Sur­pris­ingly (to me), the sec­ond al­gorithm some­times out­performs the first, even though the first al­gorithm has ac­cess to more data (since it gets ac­cess to the ground truth re­wards in some tasks). This could be be­cause it ex­ploits the as­sump­tion that the demon­stra­tor is near-op­ti­mal, which the first al­gorithm doesn’t do, even though the as­sump­tion is cor­rect for most of the mod­els we test. On the other hand, maybe it’s just ran­dom noise.

Implications

Su­per­in­tel­li­gent AI alignment

The most ob­vi­ous way that this is rele­vant to AI al­ign­ment is that it is progress on am­bi­tious value learn­ing, where we try to learn a util­ity func­tion that en­codes all of hu­man val­ues.

“But wait,″ you say, “didn’t you ar­gue that am­bi­tious value learn­ing is un­likely to work?”

Well, yes. At the time that I was do­ing this work, I be­lieved that am­bi­tious value learn­ing was the only op­tion, and seemed hard but not doomed. This was the ob­vi­ous thing to do to try and ad­vance it. But this was over a year ago, the rea­son it’s only now com­ing out is that it took a while to pub­lish the pa­per. (In fact, it pre­dates my state of the world work.) But it’s true that now I’m not very hope­ful about am­bi­tious value learn­ing, and so this pa­per’s con­tri­bu­tion to­wards it doesn’t seem par­tic­u­larly valuable to me. How­ever, a few oth­ers re­main op­ti­mistic about am­bi­tious value learn­ing, and if they’re right, this re­search might be use­ful for that path­way to al­igned AI.

I do think that the pa­per con­tributes to nar­row value learn­ing, and I still think that this very plau­si­bly will be rele­vant to AI al­ign­ment. It’s a par­tic­u­larly di­rect at­tack on the speci­fi­ca­tion prob­lem, with the goal of in­fer­ring a speci­fi­ca­tion that leads to a policy that would out­perform the demon­stra­tor. That said, I am no longer very op­ti­mistic about ap­proaches that re­quire a spe­cific struc­ture (in this case, world mod­els fed into a differ­en­tiable plan­ner with an in­duc­tive bias that then pro­duces ac­tions), and I am also less op­ti­mistic about us­ing ap­proaches that try to mimic ex­pected value calcu­la­tions, rather than try­ing to do some­thing more like norm in­fer­ence.

(How­ever, I still ex­pect that the im­pos­si­bil­ity re­sult in prefer­ence learn­ing will only be a prob­lem in the­ory, not in prac­tice. It’s just that this par­tic­u­lar method of deal­ing with it doesn’t seem like it will work.)

Near-term AI issues

In the near term, we will need bet­ter ways than re­ward func­tions to spec­ify the be­hav­ior that we want to an AI sys­tem. In­verse re­in­force­ment learn­ing is prob­a­bly the lead­ing ex­am­ple of how we could do this. How­ever, since the spe­cific al­gorithms re­quire much bet­ter differ­en­tiable plan­ners be­fore they will perform on par with ex­ist­ing al­gorithms, it may be some time be­fore they are use­ful. In ad­di­tion, it’s prob­a­bly bet­ter to use spe­cific bias mod­els in the near term. Over­all, I think these meth­ods or ideas are about as likely to be used in the near term as the av­er­age pa­per (which is to say, not very likely).


  1. A Value Iter­a­tion Net­work is a fully differ­en­tiable neu­ral net­work that em­beds an ap­prox­i­mate value iter­a­tion al­gorithm in­side a feed-for­ward clas­sifi­ca­tion net­work. ↩︎