Alignment Newsletter #42

Link post

Co­op­er­a­tive IRL as a defi­ni­tion of hu­man-AI group ra­tio­nal­ity, and an em­piri­cal eval­u­a­tion of the­ory of mind vs. model learn­ing in HRI

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.


AI Align­ment Pod­cast: Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing (Lu­cas Perry and Dy­lan Had­field-Menell) (sum­ma­rized by Richard): Dy­lan puts for­ward his con­cep­tion of Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing as a defi­ni­tion of what it means for a hu­man-AI sys­tem to be ra­tio­nal, given the in­for­ma­tion bot­tle­neck be­tween a hu­man’s prefer­ences and an AI’s ob­ser­va­tions. He notes that there are some clear mis­matches be­tween this prob­lem and re­al­ity, such as the CIRL as­sump­tion that hu­mans have static prefer­ences, and how fuzzy the ab­strac­tion of “ra­tio­nal agents with util­ity func­tions” be­comes in the con­text of agents with bounded ra­tio­nal­ity. Nev­er­the­less, he claims that this is a use­ful unify­ing frame­work for think­ing about AI safety.

Dy­lan ar­gues that the pro­cess by which a robot learns to ac­com­plish tasks is best de­scribed not just as max­imis­ing an ob­jec­tive func­tion but in­stead in a way which in­cludes the sys­tem de­signer who se­lects and mod­ifies the op­ti­mi­sa­tion al­gorithms, hy­per­pa­ram­e­ters, etc. In fact, he claims, it doesn’t make sense to talk about how well a sys­tem is do­ing with­out talk­ing about the way in which it was in­structed and the type of in­for­ma­tion it got. In CIRL, this is mod­eled via the com­bi­na­tion of a “teach­ing strat­egy” and a “learn­ing strat­egy”. The former can take many forms: pro­vid­ing rank­ings of op­tions, or demon­stra­tions, or bi­nary com­par­i­sons, etc. Dy­lan also men­tions an ex­ten­sion of this in which the teacher needs to learn their own val­ues over time. This is use­ful for us be­cause we don’t yet un­der­stand the nor­ma­tive pro­cesses by which hu­man so­cieties come to moral judge­ments, or how to in­te­grate ma­chines into that pro­cess.

On the Utility of Model Learn­ing in HRI (Ro­han Choud­hury, Gokul Swamy et al): In hu­man-robot in­ter­ac­tion (HRI), we of­ten re­quire a model of the hu­man that we can plan against. Should we use a spe­cific model of the hu­man (a so-called “the­ory of mind”, where the hu­man is ap­prox­i­mately op­ti­miz­ing some un­known re­ward), or should we sim­ply learn a model of the hu­man from data? This pa­per pre­sents em­piri­cal ev­i­dence com­par­ing three al­gorithms in an au­tonomous driv­ing do­main, where a robot must drive alongside a hu­man.

The first al­gorithm, called The­ory of Mind based learn­ing, mod­els the hu­man us­ing a the­ory of mind, in­fers a hu­man re­ward func­tion, and uses that to pre­dict what the hu­man will do, and plans around those ac­tions. The sec­ond al­gorithm, called Black box model-based learn­ing, trains a neu­ral net­work to di­rectly pre­dict the ac­tions the hu­man will take, and plans around those ac­tions. The third al­gorithm, model-free learn­ing, sim­ply ap­plies Prox­i­mal Policy Op­ti­miza­tion (PPO), a deep RL al­gorithm, to di­rectly pre­dict what ac­tion the robot should take, given the cur­rent state.

Quot­ing from the ab­stract, they “find that there is a sig­nifi­cant sam­ple com­plex­ity ad­van­tage to the­ory of mind meth­ods and that they are more ro­bust to co­vari­ate shift, but that when enough in­ter­ac­tion data is available, black box ap­proaches even­tu­ally dom­i­nate”. They also find that when the ToM as­sump­tions are sig­nifi­cantly vi­o­lated, then the black-box model-based al­gorithm will vastly sur­pass ToM. The model-free learn­ing al­gorithm did not work at all, prob­a­bly be­cause it can­not take ad­van­tage of knowl­edge of the dy­nam­ics of the sys­tem and so the learn­ing prob­lem is much harder.

Ro­hin’s opinion: I’m always happy to see an ex­per­i­men­tal pa­per that tests how al­gorithms perform, I think we need more of these.

You might be tempted to think of this as ev­i­dence that in deep RL, a model-based method should out­perform a model-free one. This isn’t ex­actly right. The first ToM and black box model-based al­gorithms use an ex­act model of the dy­nam­ics of the en­vi­ron­ment mod­ulo the hu­man, that is, they can ex­actly pre­dict the next state given the cur­rent state, the robot ac­tion, and the hu­man ac­tion. The model-free al­gorithm must learn this from scratch, so it isn’t an ap­ples-to-ap­ples com­par­i­son. (Typ­i­cally in deep RL, both model-based and model-free al­gorithms have to learn the en­vi­ron­ment dy­nam­ics.) How­ever, you can think of the ToM as a model-based method and the Black-box model-based al­gorithm as a model-free al­gorithm, where both al­gorithms have to learn the hu­man mod­elin­stead of the more tra­di­tional en­vi­ron­ment dy­nam­ics. With that anal­ogy, you would con­clude that model-based al­gorithms will be more sam­ple effi­cient and more perfor­mant in low-data regimes, but will be out­performed by model-free al­gorithms with suffi­cient data, which agrees with my in­tu­itions.

This kind of effect is a ma­jor rea­son for my po­si­tion that the first pow­er­ful AI sys­tems will be mod­u­lar (analo­gous to model-based sys­tems), but that they will even­tu­ally be re­placed by more in­te­grated, end-to-end sys­tems (analo­gous to model-free sys­tems). Ini­tially, we will be in a (rel­a­tively speak­ing) low-data regime, where mod­u­lar sys­tems ex­cel, but over time there will be more data and com­pute and we will tran­si­tion to regimes where end-to-end sys­tems will solve the same tasks bet­ter (though we may then have mod­u­lar sys­tems for more difficult tasks).

Tech­ni­cal AI alignment

Iter­ated am­plifi­ca­tion sequence

The re­ward en­g­ineer­ing prob­lem (Paul Chris­ti­ano): The re­ward en­g­ineer­ing prob­lem is the prob­lem of de­sign­ing some form of re­ward feed­back such that if the agent learns to get high re­ward, then we are happy with the re­sult­ing be­hav­ior. We as­sume that an over­seer H is pro­vid­ing re­ward for an agent A, with H smarter than A.

(Aside: Why are we as­sum­ing H is smarter, when we are typ­i­cally wor­ried about AI that is more in­tel­li­gent than we are? Be­cause we could hope to use AI tools to cre­ate this re­ward func­tion, so that the hu­man + AI team is smarter than the AI be­ing trained. In iter­ated am­plifi­ca­tion, this cor­re­sponds to the dis­til­la­tion step.)

We could imag­ine three ap­proaches: di­rect su­per­vi­sion (where H pro­vides a re­ward for ev­ery ac­tion A takes), imi­ta­tion learn­ing (where A tries to mimic H, per­haps by try­ing to fool a dis­crim­i­na­tor that dis­t­in­guishes be­tween A and H), and in­verse re­in­force­ment learn­ing (where we in­fer a re­ward func­tion by ob­serv­ing H’s be­hav­ior and use that to su­per­vise A).

With all of these ap­proaches, we might hope that since H is smarter than A, A will not be able to trick H with some be­hav­ior that gets high re­ward but is ac­tu­ally un­de­sir­able. How­ever, this ar­gu­ment has a ma­jor flaw. When A chooses an ac­tion, the pro­cess or com­pu­ta­tion by which it chooses that ac­tion is im­por­tant in­for­ma­tion, which by de­fault (es­pe­cially with neu­ral nets) only A has ac­cess to, and so this can al­low it to fool H. For ex­am­ple, if A is tasked with draw­ing origi­nal art­work, it might copy and slightly change ex­ist­ing art­work, which H may not re­al­ize if it can­not see how the art­work was made. We could hope to fix this prob­lem with trans­parency or ex­pla­na­tions, but this re­quires a lot more re­search.

Imi­ta­tion learn­ing and IRL have the prob­lem that A may not be ca­pa­ble of do­ing what H does. In that case, it will be off-dis­tri­bu­tion and may have weird be­hav­ior. Direct su­per­vi­sion doesn’t suffer from this prob­lem, but it is very time-in­effi­cient. This could po­ten­tially be fixed us­ing semi-su­per­vised learn­ing tech­niques.

Ro­hin’s opinion: The in­for­ma­tion asym­me­try prob­lem be­tween H and A seems like a ma­jor is­sue. For me, it’s the strongest ar­gu­ment for why trans­parency is a nec­es­sary in­gre­di­ent of a solu­tion to al­ign­ment. The ar­gu­ment against imi­ta­tion learn­ing and IRL is quite strong, in the sense that it seems like you can’t rely on ei­ther of them to cap­ture the right be­hav­ior. Th­ese are stronger than the ar­gu­ments against am­bi­tious value learn­ing (AN #31) be­cause here we as­sume that H is smarter than A, which we could not do with am­bi­tious value learn­ing. So it does seem to me that di­rect su­per­vi­sion (with semi-su­per­vised tech­niques and ro­bust­ness) is the most likely path for­ward to solv­ing the re­ward en­g­ineer­ing prob­lem.

There is also the ques­tion of whether it is nec­es­sary to solve the re­ward en­g­ineer­ing prob­lem. It cer­tainly seems nec­es­sary in or­der to im­ple­ment iter­ated am­plifi­ca­tion given cur­rent sys­tems (where the dis­til­la­tion step will be im­ple­mented with op­ti­miza­tion, which means that we need a re­ward sig­nal), but might not be nec­es­sary if we move away from op­ti­miza­tion or if we build sys­tems us­ing some tech­nique other than iter­ated am­plifi­ca­tion (though even then it seems very use­ful to have a good re­ward en­g­ineer­ing solu­tion).

Ca­pa­bil­ity am­plifi­ca­tion (Paul Chris­ti­ano): Ca­pa­bil­ity am­plifi­ca­tion is the prob­lem of tak­ing some ex­ist­ing policy and pro­duc­ing a bet­ter policy, per­haps us­ing much more time and com­pute. It is a par­tic­u­larly in­ter­est­ing prob­lem to study be­cause it could be used to define the goals of a pow­er­ful AI sys­tem, and it could be com­bined with re­ward en­g­ineer­ing above to cre­ate a pow­er­ful al­igned sys­tem. (Ca­pa­bil­ity am­plifi­ca­tion and re­ward en­g­ineer­ing are analo­gous to am­plifi­ca­tion and dis­til­la­tion re­spec­tively.) In ad­di­tion, ca­pa­bil­ity am­plifi­ca­tion seems sim­pler than the gen­eral prob­lem of “build an AI that does the right thing”, be­cause we get to start with a weak policy A rather than noth­ing, and were al­lowed to take lots of time and com­pu­ta­tion to im­ple­ment the bet­ter policy. It would be use­ful to tell whether the “hard part” of value al­ign­ment is in ca­pa­bil­ity am­plifi­ca­tion, or some­where else.

We can eval­u­ate ca­pa­bil­ity am­plifi­ca­tion us­ing the con­cepts of reach­a­bil­ity and ob­struc­tions. A policy C is reach­able from an­other policy A if there is some chain of poli­cies from A to C, such that at each step ca­pa­bil­ity am­plifi­ca­tion takes you from the first policy to some­thing at least as good as the sec­ond policy. Ideally, all poli­cies would be reach­able from some very sim­ple policy. This is im­pos­si­ble if there ex­ists an ob­struc­tion, that is a par­ti­tion of poli­cies into two sets L and H, such that it is im­pos­si­ble to am­plify any policy in L to get a policy that is at least as good as some policy in H. In­tu­itively, an ob­struc­tion pre­vents us from get­ting to ar­bi­trar­ily good be­hav­ior, and means that all of the poli­cies in H are not reach­able from any policy in L.

We can do fur­ther work on ca­pa­bil­ity am­plifi­ca­tion. With the­ory, we can search for challeng­ing ob­struc­tions, and de­sign pro­ce­dures that over­come them. With ex­per­i­ment, we can study ca­pa­bil­ity am­plifi­ca­tion with hu­mans (some­thing which Ought is now do­ing).

Ro­hin’s opinion: There’s a clear rea­son for work on ca­pa­bil­ity am­plifi­ca­tion: it could be used as a part of an im­ple­men­ta­tion of iter­ated am­plifi­ca­tion. How­ever, this post also sug­gests an­other rea­son for such work—it may help us de­ter­mine where the “hard part” of AI safety lies. Does it help to as­sume that you have lots of time and com­pute, and that you have ac­cess to a weaker policy?

Cer­tainly if you just have ac­cess to a weaker policy, this doesn’t make the prob­lem any eas­ier. If you could take a weak policy and am­plify it into a stronger policy effi­ciently, then you could just re­peat­edly ap­ply this policy-im­prove­ment op­er­a­tor to some very weak base policy (say, a neu­ral net with ran­dom weights) to solve the full prob­lem. (In other var­i­ants, you have a much stronger al­igned base policy, eg. the hu­man policy with short in­puts and over a short time hori­zon; in that case this as­sump­tion is more pow­er­ful.) The more in­ter­est­ing as­sump­tion is that you have lots of time and com­pute, which does seem to have a lot of po­ten­tial. I feel pretty op­ti­mistic that a hu­man think­ing for a long time could reach “su­per­hu­man perfor­mance” by our cur­rent stan­dards; ca­pa­bil­ity am­plifi­ca­tion asks if we can do this in a par­tic­u­lar struc­tured way.

Value learn­ing sequence

Re­ward un­cer­tainty (Ro­hin Shah): Given that we need hu­man feed­back for the AI sys­tem to stay “on track” as the en­vi­ron­ment changes, we might de­sign a sys­tem that keeps an es­ti­mate of the re­ward, chooses ac­tions that op­ti­mize that re­ward, but also up­dates the re­ward over time based on feed­back. This has a few is­sues: it typ­i­cally as­sumes that the hu­man Alice knows the true re­ward func­tion, it makes a pos­si­bly-in­cor­rect as­sump­tion about the mean­ing of Alice’s feed­back, and the AI sys­tem still looks like a long-term goal-di­rected agent where the goal is the cur­rent re­ward es­ti­mate.

This post takes the above AI sys­tem and con­sid­ers what hap­pens if you have a dis­tri­bu­tion over re­ward func­tions in­stead of a point es­ti­mate, and dur­ing ac­tion se­lec­tion you take into ac­count fu­ture up­dates to the dis­tri­bu­tion. (This is the setup of Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing.) While we still as­sume that Alice knows the true re­ward func­tion, and we still re­quire an as­sump­tion about the mean­ing of Alice’s feed­back, the re­sult­ing sys­tem looks less like a goal-di­rected agent.

In par­tic­u­lar, the sys­tem no longer has an in­cen­tive to dis­able the sys­tem that learns val­ues from feed­back: while pre­vi­ously it changed the AI sys­tem’s goal (a nega­tive effect from the goal’s per­spec­tive), now it pro­vides more in­for­ma­tion about the goal (a pos­i­tive effect). In ad­di­tion, the sys­tem has more of an in­cen­tive to let it­self be shut down. If a hu­man is about to shut it down, it should up­date strongly that what­ever it was do­ing was very bad, caus­ing a dras­tic up­date on re­ward func­tions. It may still pre­vent us from shut­ting it down, but it will at least stop do­ing the bad thing. Even­tu­ally, af­ter gath­er­ing enough in­for­ma­tion, it would con­verge on the true re­ward and do the right thing. Of course, this is as­sum­ing that the space of re­wards is well-speci­fied, which will prob­a­bly not be true in prac­tice.

Fol­low­ing hu­man norms (Ro­hin Shah): One ap­proach to pre­vent­ing catas­tro­phe is to con­strain the AI sys­tem to never take catas­trophic ac­tions, and not fo­cus as much on what to do (which will be solved by progress in AI more gen­er­ally). In this set­ting, we hope that our AI sys­tems ac­cel­er­ate our rate of progress, but we re­main in con­trol and use AI sys­tems as tools that al­low us make bet­ter de­ci­sions and bet­ter tech­nolo­gies. Im­pact mea­sures /​ side effect penalties aim to define what not to do. What if we in­stead learn what not to do? This could look like in­fer­ring and fol­low­ing hu­man norms, along the lines of ad hoc team­work.

This is differ­ent from nar­row value learn­ing for a few rea­sons. First, nar­row value learn­ing also learns what to do. Se­cond, it seems likely that norm in­fer­ence only gives good re­sults in the con­text of groups of agents, while nar­row value learn­ing could be ap­plied in singe agent set­tings.

The main ad­van­tages of learn­ing norms is that this is some­thing that hu­mans do quite well, so it may be sig­nifi­cantly eas­ier than learn­ing “val­ues”. In ad­di­tion, this ap­proach is very similar to our ways of pre­vent­ing hu­mans from do­ing catas­trophic things: there is a shared, ex­ter­nal sys­tem of norms that ev­ery­one is ex­pected to fol­low. How­ever, norm fol­low­ing is a weaker stan­dard than am­bi­tious value learn­ing (AN #31), and there are more prob­lems as a re­sult. Most no­tably, pow­er­ful AI sys­tems will lead to rapidly evolv­ing tech­nolo­gies, that cause big changes in the en­vi­ron­ment that might re­quire new norms; norm-fol­low­ing AI sys­tems may not be able to cre­ate or adapt to these new norms.

Agent foundations

CDT Dutch Book (Abram Dem­ski)

CDT=EDT=UDT (Abram Dem­ski)

Learn­ing hu­man intent

AI Align­ment Pod­cast: Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing (Lu­cas Perry and Dy­lan Had­field-Menell): Sum­ma­rized in the high­lights!

On the Utility of Model Learn­ing in HRI (Ro­han Choud­hury, Gokul Swamy et al): Sum­ma­rized in the high­lights!

What AI Safety Re­searchers Have Writ­ten About the Na­ture of Hu­man Values (av­turchin): This post cat­e­go­rizes the­o­ries of hu­man val­ues along three axes. First, how com­plex is the de­scrip­tion of the val­ues? Se­cond, to what ex­tent are “val­ues” defined as a func­tion of be­hav­ior (as op­posed to be­ing a func­tion of eg. the brain’s al­gorithm)? Fi­nally, how broadly ap­pli­ca­ble is the the­ory: could it ap­ply to ar­bi­trary minds, or only to hu­mans? The post then sum­ma­rizes differ­ent po­si­tions on hu­man val­ues that differ­ent re­searchers have taken.

Ro­hin’s opinion: I found the cat­e­go­riza­tion use­ful for un­der­stand­ing the differ­ences be­tween views on hu­man val­ues, which can be quite varied and hard to com­pare.

Risk-Aware Ac­tive In­verse Re­in­force­ment Learn­ing (Daniel S. Brown, Yuchen Cui et al): This pa­per pre­sents an al­gorithm that ac­tively so­lic­its demon­stra­tions on states where it could po­ten­tially be­have badly due to its un­cer­tainty about the re­ward func­tion. They use Bayesian IRL as their IRL al­gorithm, so that they get a dis­tri­bu­tion over re­ward func­tions. They use the most likely re­ward to train a policy, and then find a state from which that policy has high risk (be­cause of the un­cer­tainty over re­ward func­tions). They show in ex­per­i­ments that this performs bet­ter than other ac­tive IRL al­gorithms.

Ro­hin’s opinion: I don’t fully un­der­stand this pa­per—how ex­actly are they search­ing over states, when there are ex­po­nen­tially many of them? Are they sam­pling them some­how? It’s definitely pos­si­ble that this is in the pa­per and I missed it, I did skim it fairly quickly.

Other progress in AI

Re­in­force­ment learning

Soft Ac­tor-Critic: Deep Re­in­force­ment Learn­ing for Robotics (Tuo­mas Haarnoja et al)

Deep learning

A Com­pre­hen­sive Sur­vey on Graph Neu­ral Net­works (Zong­han Wu et al)

Graph Neu­ral Net­works: A Re­view of Meth­ods and Ap­pli­ca­tions (Jie Zhou, Ganqu Cui, Zhengyan Zhang et al)


Ols­son to Join the Open Philan­thropy Pro­ject (sum­ma­rized by Dan H): Cather­ine Ols­son‏, a re­searcher at Google Brain who was pre­vi­ously at OpenAI, will be join­ing the Open Philan­thropy Pro­ject to fo­cus on grant mak­ing for re­duc­ing x-risk from ad­vanced AI. Given her first-hand re­search ex­pe­rience, she has knowl­edge of the dy­nam­ics of re­search groups and a nu­anced un­der­stand­ing of var­i­ous safety sub­prob­lems. Con­grat­u­la­tions to both her and OpenPhil.

An­nounce­ment: AI al­ign­ment prize round 4 win­ners (cousin_it): The last iter­a­tion of the AI al­ign­ment prize has con­cluded, with awards of $7500 each to Pe­nal­iz­ing Im­pact via At­tain­able Utility Preser­va­tion (AN #39) and Embed­ded Agency (AN #31, AN #32), and $2500 each to Ad­dress­ing three prob­lems with coun­ter­fac­tual cor­rigi­bil­ity (AN #30) and Three AI Safety Re­lated Ideas/​Two Ne­glected Prob­lems in Hu­man-AI Safety (AN #38).