[AN #110]: Learning features from human feedback to enable reward learning

Link post

Newslet­ter #110

Align­ment Newslet­ter is a weekly pub­li­ca­tion with re­cent con­tent rele­vant to AI al­ign­ment around the world. Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Au­dio ver­sion here (may not be up yet).

HIGHLIGHTS

Fea­ture Ex­pan­sive Re­ward Learn­ing: Re­think­ing Hu­man In­put (An­dreea Bobu, Mar­ius Wig­gert et al) (sum­ma­rized by Ro­hin): One goal we might have with our al­gorithms is that af­ter train­ing, when the AI sys­tem is de­ployed with end users, the sys­tem would be per­son­al­ized to those end users. You might hope that we could use deep in­verse RL al­gorithms like AIRL (AN #17), but un­for­tu­nately they re­quire a lot of data, which isn’t fea­si­ble for end users. You could use ear­lier IRL al­gorithms like MCEIRL (AN #12) that re­quire you to spec­ify what fea­tures of the en­vi­ron­ment you care about, but in prac­tice you’ll never suc­cess­fully write down all of these fea­tures. Can we some­how get the best of both wor­lds?

Past work (AN #28) made progress on this front, by al­low­ing the agent to at least de­tect when it is miss­ing some fea­ture, by check­ing whether the hu­man feed­back is sur­pris­ingly in­effi­cient given the ex­ist­ing fea­tures. But what do you do once you de­tect it? The key in­sight of this pa­per is that ap­ply­ing a deep IRL al­gorithm here would be in­effi­cient be­cause it has to im­plic­itly learn the un­known fea­ture, and we can do much bet­ter by ex­plic­itly query­ing the hu­man for the un­known fea­ture.

In par­tic­u­lar, their method Fea­ture Ex­pan­sive Re­ward Learn­ing (FERL) asks the hu­man for a few fea­ture traces: demon­stra­tions in which the new fea­ture’s value mono­ton­i­cally de­creases. For ex­am­ple, sup­pose a robot arm car­ry­ing a cup of wa­ter gets too close to a lap­top, but the arm doesn’t know the fea­ture “close to a lap­top”. Then a fea­ture trace would start with the arm close to the lap­top, and move it suc­ces­sively fur­ther away. Given a set of fea­ture traces, we can con­vert this into a dataset of noisy com­par­i­sons, where ear­lier states are more likely to have higher fea­ture val­ues than later states, and use this to train a neu­ral net to pre­dict the fea­ture value (similarly to the re­ward model in Deep RL from Hu­man Prefer­ences). We can then add this to our set of fea­tures, and learn re­wards over the new set of fea­tures.

They eval­u­ate their method with a few hu­man-robot in­ter­ac­tion sce­nar­ios (though with­out a user study due to COVID), com­par­ing it against deep MaxEnt IRL, and find that their method does bet­ter on a va­ri­ety of met­rics.

Ro­hin’s opinion: I re­ally liked this pa­per—it seems like a far more effi­cient use of hu­man feed­back to figure out what fea­tures of the en­vi­ron­ment are im­por­tant. This doesn’t need to be limited to re­ward learn­ing: I ex­pect that learn­ing the right fea­tures to fo­cus on would help with ex­plo­ra­tion in re­in­force­ment learn­ing, out-of-dis­tri­bu­tion gen­er­al­iza­tion, etc. It also seems plau­si­ble that in more com­plex en­vi­ron­ments you could learn a set of fea­tures that was use­ful for all of these tasks, thus be­ing some­what gen­eral (though still spe­cific to the en­vi­ron­ment).

It’s worth not­ing that in this set­ting you wouldn’t re­ally want to use a vanilla deep IRL al­gorithm—you’d in­stead want to do some­thing like meta-IRL.

TECHNICAL AI ALIGNMENT

ITERATED AMPLIFICATION

Par­allels Between AI Safety by De­bate and Ev­i­dence Law (Cul­len O’Keefe) (sum­ma­rized by Ro­hin): De­bate (AN #86) re­quires us to provide a struc­ture for a de­bate as well as rules for how the hu­man judge should de­cide who wins. This post points out that we have an ex­ist­ing sys­tem that has been heav­ily op­ti­mized for this already: ev­i­dence law, which gov­erns how court cases are run. A court case is high-stakes and in­volves two sides pre­sent­ing op­pos­ing opinions; ev­i­dence law tells us how to struc­ture these ar­gu­ments and how to limit the kinds of ar­gu­ments de­baters can use. Ev­i­dence is gen­er­ally ad­mis­si­ble by de­fault, but there are many ex­cep­tions, of­ten based on the fal­li­bil­ity of fact-fin­ders.

As a re­sult, it may be fruit­ful to look to ev­i­dence law for how we might struc­ture de­bates, and to see what types of ar­gu­ments we should be look­ing for.

Ro­hin’s opinion: This seems em­i­nently sen­si­ble to me. Of course, ev­i­dence law is go­ing to be spe­cial­ized to ar­gu­ments about in­no­cence or guilt of a crime, and may not gen­er­al­ize to what we would like to do with de­bate, but it still seems like we should be able to learn some gen­er­al­iz­able les­sons.

Weak HCH ac­cesses EXP (Evan Hub­inger) (sum­ma­rized by Ro­hin): This fol­lowup to last week’s Align­ment pro­pos­als and com­plex­ity classes (AN #109) shows that the am­plifi­ca­tion-based pro­pos­als can ac­cess EXP.

LEARNING HUMAN INTENT

Multi-Prin­ci­pal As­sis­tance Games (Ar­naud Fick­inger et al) (sum­ma­rized by Ro­hin): So far the work in the as­sis­tance games frame­work (AN #69) (pre­vi­ously called CIRL) has fo­cused on the case where there is a sin­gle hu­man and a sin­gle AI as­sis­tant. Once we have mul­ti­ple hu­mans (or prin­ci­pals, as the pa­per calls them), things get much trick­ier.

One prob­lem is that we don’t know how to ag­gre­gate the val­ues across differ­ent prin­ci­pals. Rather than tak­ing a stance on the prob­lem, this pa­per as­sumes that we have some mechanism that can com­bine re­ward func­tions in some rea­son­able way. It in­stead fo­cuses on a sec­ond prob­lem: while pre­vi­ously we could trust the hu­man to re­port their prefer­ences ac­cu­rately (as the hu­man and agent were al­igned), when there are mul­ti­ple prin­ci­pals whose prefer­ence will be ag­gre­gated, the prin­ci­pals have an in­cen­tive to mis­rep­re­sent their prefer­ences (which we’ll call non-straight­for­ward play).

Let’s con­sider the case where the prin­ci­pals provide demon­stra­tions, and get re­ward for those demon­stra­tions. For now our agent will as­sume that the prin­ci­pals are play­ing straight­for­wardly, and so the agent sim­ply in­fers their prefer­ences, ag­gre­gates them, and op­ti­mizes the re­sults. In this set­ting, if the agent will act far more of­ten than the prin­ci­pals provide demon­stra­tions (so that the re­ward of the demon­stra­tions is al­most ir­rele­vant), we can ap­ply the Gib­bard-Sat­terth­waite the­o­rem to show that any non-triv­ial mechanism will be vuln­er­a­ble to non-straight­for­ward play. In con­trast, if the prin­ci­pals provide lots of demon­stra­tions, while the agent only acts for a short pe­riod of time, then op­ti­mal prin­ci­pals pri­mar­ily want to en­sure their demon­stra­tions are good, and so will be straight­for­ward most of the time (prov­ably). In the mid­dle, the fact that prin­ci­pals get re­warded for demon­stra­tions does help re­duce non-straight­for­ward play, but does not elimi­nate it.

Now let’s con­sider the case where the agent can de­sign a mechanism. Here, when the prin­ci­pals are pro­vid­ing demon­stra­tions, the agent can over­ride their ac­tion choice with one of its own (a set­ting con­sid­ered pre­vi­ously (AN #70)). Roughly speak­ing, the al­gorithm only ex­e­cutes a pro­posed hu­man ac­tion if it hasn’t ex­e­cuted it be­fore. By do­ing so, it in­cen­tivizes the prin­ci­pals to re­port sec­ond-best ac­tions, and so on, giv­ing the agent more in­for­ma­tion about the prin­ci­pals’ util­ity func­tions. The mechanism in­cen­tivizes straight­for­ward play, and is ap­prox­i­mately effi­cient (i.e. there is an up­per bound on the worst case so­cial welfare achieved).

Ro­hin’s opinion: Ac­cord­ing to me, the main in­sight of this pa­per is that it is both nec­es­sary and difficult to de­sign mechanisms that in­cen­tivize prin­ci­pals to re­port not just the best thing to do, but a com­par­i­son amongst differ­ent al­ter­na­tives. Within the for­mal­ism of pa­per, this is done by over­rid­ing a prin­ci­pal’s ac­tion un­less it is a novel ac­tion, but I ex­pect in prac­tice we’ll do this in some other way (it seems rather un­usual to imag­ine the agent over­rid­ing a hu­man, I’d be sur­prised if that was how we ended up build­ing our AI sys­tems).

Ad­ver­sar­ial Soft Ad­van­tage Fit­ting: Imi­ta­tion Learn­ing with­out Policy Op­ti­miza­tion (Paul Barde, Julien Roy, Won­seok Jeon et al) (sum­ma­rized by Sud­han­shu): This work aims to sim­plify al­gorithms for ad­ver­sar­ial imi­ta­tion learn­ing by us­ing a struc­tured dis­crim­i­na­tor, which is pa­ram­e­ter­ised by the cur­rent gen­er­a­tor and a learned policy. They prove that if so for­mu­lated, the policy that yields the op­ti­mal dis­crim­i­na­tor is ex­actly the same as the policy that gen­er­ated the ex­pert data, which is also pre­cisely what we hope the gen­er­a­tor will learn. As long as the dis­crim­i­na­tor’s learned policy is pa­ram­e­ter­ised cor­rectly such that it can be sam­pled and eval­u­ated, this elimi­nates the need for a re­in­force­ment learn­ing outer loop for policy im­prove­ment, as this learned policy can be sub­sti­tuted in for the gen­er­a­tor’s policy in the next train­ing iter­a­tion. They em­piri­cally show the com­pet­i­tive­ness of their method with state-of-the-art al­gorithms across a small but in­creas­ingly com­plex suite of tasks.

Sud­han­shu’s opinion: Since their the­o­ret­i­cal re­sults are only for op­ti­mal val­ues, it’s un­clear whether start­ing from ran­dom ini­tial poli­cies will nec­es­sar­ily con­verge to these op­ti­mal val­ues—in­deed, they make this point them­selves, that they do not train to con­ver­gence as gra­di­ent de­scent can­not hope to find the global op­ti­mum for GAN-like non-con­vex loss func­tions. In light of that, it’s not ev­i­dent why their al­gorithms out­perform the com­pe­ti­tion. Ad­di­tion­ally, they do not re­port com­pu­ta­tional speed-up or wall-clock com­par­i­sons, which to me felt like the broad mo­ti­va­tion be­hind this work. Nonethe­less, the work illu­mi­nates new ter­ri­tory in ad­ver­sar­ial imi­ta­tion learn­ing, pro­vides pos­i­tive ev­i­dence for a novel tech­nique, and raises in­ter­est­ing ques­tions for fu­ture work, such as how to learn ro­bust re­ward func­tions via this method, or what kind of con­ver­gence prop­er­ties can be ex­pected.

Ex­pla­na­tion Aug­mented Feed­back in Hu­man-in-the-Loop Re­in­force­ment Learn­ing (Lin Guan, Mu­dit Verma et al) (sum­ma­rized by Ro­hin): This pa­per starts from a similar po­si­tion as the high­lighted pa­per: that we can im­prove on al­gorithms by hav­ing hu­mans provide differ­ent kinds of feed­back that help with learn­ing. They ask hu­mans to provide “ex­pla­na­tions” to im­prove sam­ple effi­ciency in deep RL, which in this case means ask­ing a hu­man to seg­ment parts of the image ob­ser­va­tion that are im­por­tant (similar to a saliency map). They use this to define aux­iliary losses that in­cen­tivize the agent to be in­var­i­ant to aug­men­ta­tions of the ir­rele­vant parts of the image. Their em­piri­cal eval­u­a­tion shows im­prove­ments in sam­ple effi­ciency rel­a­tive to sim­ple good/​bad eval­u­a­tive feed­back.

Ro­hin’s opinion: The idea is cool, but the em­piri­cal re­sults are not great. On Taxi, train­ing with the re­ward sig­nal and bi­nary good/​bad eval­u­a­tive feed­back takes 180k en­vi­ron­ment steps, and adding in ex­pla­na­tions for a quar­ter of the steps brings it down to 130k en­vi­ron­ment steps. How­ever, this seems like it would in­crease the hu­man effort re­quired by an or­der of mag­ni­tude or more, which seems way too high for the benefit pro­vided.

It does seem to me that saliency ex­pla­na­tions could con­tain a fair amount of in­for­ma­tion, and so you should be able to do bet­ter—maybe a fu­ture al­gorithm will do so.

FORECASTING

Align­ment As A Bot­tle­neck To Use­ful­ness Of GPT-3 (John S. Went­worth) (sum­ma­rized by Ro­hin): Cur­rently, many peo­ple are try­ing to figure out how to prompt GPT-3 into do­ing what they want—in other words, how to al­ign GPT-3 with their de­sires. GPT-3 may be ca­pa­ble of the task, but that doesn’t mean it will do it (po­ten­tial ex­am­ple). This sug­gests that al­ign­ment will soon be a bot­tle­neck on our abil­ity to get value from large lan­guage mod­els.

Cer­tainly GPT-3 isn’t perfectly ca­pa­ble yet. The au­thor thinks that in the im­me­di­ate fu­ture the ma­jor bot­tle­neck will still be its ca­pa­bil­ity, but we have a clear story for how to im­prove its ca­pa­bil­ities: just scale up the model and data even more. Align­ment on the other hand is much harder: we don’t know how to trans­late (AN #94) the tasks we want into a for­mat that will cause GPT-3 to “try” to ac­com­plish that task.

As a re­sult, in the fu­ture we might ex­pect a lot more work to go into prompt de­sign (or what­ever be­comes the next way to di­rect lan­guage mod­els at spe­cific tasks). In ad­di­tion, once GPT is bet­ter than hu­mans (at least in some do­mains), al­ign­ment in those do­mains will be par­tic­u­larly difficult, as it is un­clear how you would get a sys­tem trained to mimic hu­mans to do bet­ter than hu­mans (AN #31).

Ro­hin’s opinion: The gen­eral point of this post seems clearly cor­rect and worth point­ing out. I’m look­ing for­ward to the work we’ll see in the fu­ture figur­ing out how to ap­ply these broad and gen­eral meth­ods to real tasks in a re­li­able way.

MISCELLANEOUS (ALIGNMENT)

Gen­er­al­iz­ing the Power-Seek­ing The­o­rems (Alex Turner) (sum­ma­rized by Ro­hin): Pre­vi­ously (AN #78) we’ve seen that if we take an MDP, and have a dis­tri­bu­tion over state-based re­ward func­tions, such that the re­ward for two differ­ent states is iid, then far­sighted (i.e. no dis­count) op­ti­mal agents tend to seek “power”. This post re­laxes some of these re­quire­ments, giv­ing suffi­cient (but not nec­es­sary) crite­ria for de­ter­min­ing in­stru­men­tal con­ver­gence.

Some of these use a new kind of ar­gu­ment. Sup­pose that ac­tion A leads you to a part of the MDP mod­eled by a graph G1, and B leads you to a part of the MDP mod­eled by a graph G2. If there is a sub­graph of G2 that is iso­mor­phic to G1, then we know that what­ever kinds of choices the agent would have by tak­ing ac­tion A, the agent would also have those choices from ac­tion B, and so we know B is at least as likely as A. This matches our in­tu­itive rea­son­ing—col­lect­ing re­sources is in­stru­men­tally con­ver­gent be­cause you can do the same things that you could if you didn’t col­lect re­sources, as well as some ad­di­tional things en­abled by your new re­sources.

AI STRATEGY AND POLICY

AI Benefits (Cul­len O’Keefe) (sum­ma­rized by Ro­hin): This se­quence of posts in­ves­ti­gates AI Benefits: how a bene­fac­tor can lev­er­age ad­vanced AI sys­tems to benefit hu­man­ity. It fo­cuses on what can be done by a sin­gle bene­fac­tor, out­side of what we might think of as the “norm”—in par­tic­u­lar, the se­quence ig­nores benefits that would be pro­vided by de­fault mar­ket in­cen­tives. This is rele­vant to OpenAI (where the au­thor works) given their fo­cus on en­sur­ing AI is benefi­cial to hu­man­ity.

Note that AI Benefits is dis­tinct from AI al­ign­ment. Some­times AI al­ign­ment is defined broadly enough to en­com­pass AI Benefits, but of­ten it is not, e.g. if the no­tion of be­ing “al­igned” de­pends on an AI sys­tem be­ing al­igned with some prin­ci­pal, that would not be AI Benefits, since AI Benefits are meant to ac­crue to all of hu­man­ity. While it is about max­i­miz­ing well-be­ing by de­fault, it should also have sec­ondary goals of equal­ity, au­ton­omy, de­moc­ra­ti­za­tion, and epistemic mod­esty.

The ob­vi­ous ap­proach to AI Benefits is the di­rect ap­proach: figur­ing out how to ap­ply ad­vanced AI to di­rectly gen­er­ate benefits for hu­man­ity, e.g. by pro­duc­ing elec­tric­ity more effi­ciently to miti­gate cli­mate change. How­ever, it is im­por­tant to also con­sider the in­di­rect ap­proach of mak­ing money us­ing AI, and then donat­ing the sur­plus to a differ­ent or­ga­ni­za­tion that can bet­ter pro­duce benefits.

Given the mas­sive num­ber of po­ten­tial ways to benefit hu­man­ity and our un­cer­tainty about their effi­cacy, it is im­por­tant to have a port­fo­lio ap­proach to AI Benefits, rather than scal­ing up a sin­gle in­ter­ven­tion. In ad­di­tion, since any given in­ter­ven­tion will prob­a­bly pri­mar­ily benefit some sub­set of hu­man­ity, a port­fo­lio ap­proach should help lead to more equal dis­tri­bu­tion of benefits.

There are many out­stand­ing ques­tions on how AI Benefits should be done in prac­tice. Should the bene­fac­tor pur­sue a di­rect or in­di­rect ap­proach? To what ex­tent should they ex­plore po­ten­tial ap­proaches for gen­er­at­ing benefits, rel­a­tive to ex­ploit­ing ap­proaches that we know work? Should they gen­er­ate benefits now, or in­vest in the abil­ity to gen­er­ate benefits later? Should they fo­cus on global (supra­na­tional) ap­proaches, or al­lo­cate re­sources to each na­tion that can be used in a man­ner spe­cial­ized to their cit­i­zens?

There are many ques­tions on the gov­er­nance side as well. We will pre­sum­ably want some Benefits Group in­volv­ing ex­ter­nal ex­perts to help dis­tribute benefits op­ti­mally. When should such a group get demo­cratic in­put? How do we eval­u­ate such a group to en­sure they are ac­tu­ally benefit­ing hu­man­ity op­ti­mally? To what ex­tent will we also need in­ter­nal gov­er­nance within the group and bene­fac­tor, and how can this be done?

Ro­hin’s opinion: AI Benefits is effec­tively ask­ing how we can an­swer the ques­tion of how to do the most good in the fu­ture, and as such many of the con­sid­er­a­tions also come up in effec­tive al­tru­ism, es­pe­cially at the cur­rent high level of ab­strac­tion. Nonethe­less, there are differ­ences in the situ­a­tion, which will mat­ter: for ex­am­ple, the effec­tive al­tru­ism com­mu­nity does not cur­rently need to plan for the situ­a­tion where they con­trol a ma­jor­ity of the world’s re­sources; a suffi­ciently am­bi­tious and op­ti­mistic AI com­pany may need to. Such a situ­a­tion vastly in­creases the im­por­tance of e.g. demo­cratic in­put, port­fo­lio ap­proaches, and in­for­ma­tion value. I’m glad that these ques­tions are be­ing tack­led now and look for­ward to see­ing more de­tails in the fu­ture.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

An Op­ti­mistic Per­spec­tive on Offline Re­in­force­ment Learn­ing (Rishabh Agar­wal et al) (sum­ma­rized by Zach): Off-policy re­in­force­ment learn­ing (RL) that can be done us­ing offline-logged in­ter­ac­tions is an im­por­tant as­pect of real-world ap­pli­ca­tions. How­ever, most RL al­gorithms as­sume that an agent in­ter­acts with an on­line en­vi­ron­ment or simu­la­tor and learns from its own col­lected ex­pe­rience. More­over, the au­thors show that DQN trained offline on its own ex­pe­rience re­play buffer has markedly de­creased perfor­mance on most of the Atari suite. The au­thors at­tempt to ad­dress this dis­crep­ancy by in­tro­duc­ing a ro­bust Q-learn­ing al­gorithm that ran­domly mixes es­ti­mates for par­tic­u­lar Q-val­ues. Speci­fi­cally, by cre­at­ing con­vex com­bi­na­tions from an un­der­ly­ing ba­sis of Q-value es­ti­mates the au­thors are able to cre­ate a much larger en­sem­ble. This is similar in spirit to dropout in deep learn­ing where con­nec­tions in the net­work are ran­domly turned off. The au­thors then go on to show that offline DQN is fea­si­ble by train­ing this al­gorithm and other re­lated al­gorithms on the DQN Re­play Dataset and show it has com­pa­rable perfor­mance to, and oc­ca­sion­ally even sur­passes, the origi­nal RL baselines. The DQN Re­play Dataset is re­leased at https://​​offline-rl.github.io/​​.

Zach’s opinion: What I learned from this pa­per is that es­ti­mat­ing the mean Q-value is not always enough for ro­bust­ness. By lev­er­ag­ing dis­tri­bu­tional in­for­ma­tion, via en­sem­bles or quan­tiles, these meth­ods can be­come quite effec­tive at offline DQN. The re­lease of the dataset is also im­pres­sive. I think the dataset will have broad ap­pli­ca­bil­ity to re­searchers in­ter­ested in offline RL as well as imi­ta­tion learn­ing.

FEEDBACK

I’m always happy to hear feed­back; you can send it to me, Ro­hin Shah, by re­ply­ing to this email.

PODCAST

An au­dio pod­cast ver­sion of the Align­ment Newslet­ter is available. This pod­cast is an au­dio ver­sion of the newslet­ter, recorded by Robert Miles.