Human-Aligned AI Summer School: A Summary

(Dis­claimer: this sum­mary is in­com­plete and does not ac­cu­rately rep­re­sent all the con­tent pre­sented at the sum­mer school, but only what I re­mem­ber and seem to have un­der­stood from the lec­tures. Don’t hes­i­tate to men­tion im­por­tant ideas I missed or ap­par­ent con­fu­sion.)

Last week, I at­tended the first edi­tion of the hu­man-al­igned AI sum­mer school in Prague. After three days, my mem­o­ries are already start­ing to fade, and I am un­sure about what I will re­tain in the long-term.

Here, I try to re­mem­ber the con­tent of about 15h of talks. It serves the fol­low­ing pur­poses:

  • To the gen­eral au­di­ence that did not at­tend the school, I try to give an overview, to in­form about the gen­eral trends we dis­cussed.

  • For those who at­tended the school, I dis­till what I un­der­stood, to re­fresh our mem­o­ries.

Value Learn­ing (Daniel Filan)

Value Learn­ing aims at in­fer­ing hu­man val­ues from their be­hav­ior. Paul Chris­ti­ano dis­t­in­guishes am­bi­tious value learn­ing vs. nar­row value learn­ing:

  • Am­bi­tious value learn­ing: learn hu­man prefer­ences over long-term out­comes.

  • Nar­row value learn­ing: learn hu­man in­stru­men­tal val­ues and sub­goals.

In­verse Re­in­force­ment Learning

In­verse Re­in­force­ment Learn­ing (IRL) stud­ies which re­ward best ex­plains a be­havi­our. Two meth­ods of IRL were dis­cussed (the state-of-the-art builds on top of those two, for in­stance us­ing neu­ral net­works):

  • Bayesian IRL: uses Bayesian up­date. Does not work in prac­tice be­cause ne­ces­sits to solve many Markov De­ci­sion Pro­cesses, which is com­pu­ta­tion­ally in­ten­sive.

  • Max­i­mum En­tropy IRL: the op­ti­mal dis­tri­bu­tion (of max­i­mum en­tropy) is an ex­po­nen­tial of a lin­ear func­tion. One of the rea­son it performs bet­ter in prac­tice is that it’s eas­ier to effi­ciently ap­prox­i­mate the rele­vant in­te­grals.

Why not to do value learn­ing:

  • It is (still) inefficient

  • It de­pends heav­ily on hu­man ra­tio­nal­ity models

  • The re­ward might not be in the prior re­ward space

  • Solv­ing other prob­lems, such as nat­u­ral­ized agency, might be more urgent

  • The ac­tions in the be­hav­ior are not well-defined in prac­tice (e.g. what counts as an ac­tion in a foot­ball game?)

Beyond In­verse Re­in­force­ment Learning

The main prob­lem of tra­di­tional IRL is that it does not take into ac­count the de­liber­ate in­ter­ac­tions be­tween a hu­man and an AI (e.g. the hu­man could be slow­ing down his be­havi­our to help learn­ing).

Co­op­er­a­tive IRL solves this is­sue by in­tro­duc­ing a two-player game be­tween the hu­man and the AI, where both are re­warded ac­cord­ing to the hu­man’s re­ward func­tion. This in­cen­tivizes the hu­man to teach the AI his prefer­ences (if the hu­man only choses its best ac­tion, the AI would learn the wrong dis­tri­bu­tion). Us­ing a similar dy­namic, the off-switch game en­courages the AI to al­low him­self to be switched off.

Another ad­ver­sity when im­ple­ment­ing IRL is that the re­ward func­tion is difficult to com­pletely spec­ify, and will of­ten not cap­ture all of what the de­signer wants. In­verse re­ward de­sign makes the AI quan­tify his un­cer­tainty about states. If the AI is risk-averse, it will avoid un­cer­tain states, for in­stance situ­a­tions where it be­lieves hu­mans have not com­pletely defined the re­ward func­tion be­cause they did not know much about it.

Agent Foun­da­tions (Abram Dem­ski)

Abram’s first talk was about his post “Prob­a­bil­ity is Real, and Value is Com­plex”. At the end of the talk, sev­eral peo­ple (in­clud­ing me) were con­fused about the “magic cor­re­la­tion” be­tween prob­a­bil­ities and ex­pected util­ity, and asked Abram about the mean­ing of his talk.

From what I un­der­stood, the point was to show a counter-in­tu­itive con­se­quence of choos­ing Jeffrey-Bolker ax­ioms in de­ci­sion the­ory over Sav­age ax­ioms. Be­cause Bayes’ al­gorithm can be for­mal­ized us­ing Jeffrey-Bolker ax­ioms, this counter-in­tu­itive re­sult challenges po­ten­tial agent de­signs that would use Bayesian up­dates.

The sec­ond talk was more gen­eral, and ad­dressed sev­eral prob­lems faced by em­bed­ded agents (e.g. nat­u­ral­ized in­duc­tion).

Bounded Ra­tion­al­ity (Daniel Filan /​ Daniel Braun)

To make sure an AI would be able to un­der­stand hu­mans, we need to make sure it un­der­stands their bounded ra­tio­nal­ity, i.e. how sparse in­for­ma­tion and a bounded com­pu­ta­tional power limit ra­tio­nal­ity.

In­for­ma­tion-The­o­retic Bounded Ra­tion­al­ity (Daniel Braun)

The first talk on the topic in­tro­duced a de­ci­sion-com­plex­ity C(A|B) that ex­pressed the “cost” of go­ing from the refer­ence B to the tar­get A (pro­por­tional to the Shan­non In­for­ma­tion of A given B). In­tu­itively, it rep­re­sents the cost in search pro­cess when go­ing from a prior B to a pos­te­rior A. After some math­e­mat­i­cal ma­nipu­la­tions, a con­cept of “in­for­ma­tion cost” is in­tro­duced, and the fi­nal frame­work high­lights a trade-off be­tween some “in­for­ma­tion util­ity” and this “in­for­ma­tion cost” (for more de­tails see here, pp. 14-18).

Hu­man ir­ra­tional­ity in plan­ning (Daniel Filan)

Hu­mans seem to ex­hibit a strong prefer­ence in plan­ning hi­er­ar­chi­cally, and are “ir­ra­tional” in that sense, or at least not “Boltz­mann-ra­tio­nal” (Cundy & Filan, 2018).

Hier­ar­chi­cal RL is a frame­work used in plan­ning that in­tro­duces “op­tions” in Markov De­ci­sion Pro­cesses where Bel­l­man Equa­tions still hold.

State-of-the-art meth­ods in Hier­ar­chi­cal RL in­clude meta-learn­ing of the hi­er­ar­chy or a two-mod­ules neu­ral net­work.

Side effects (Vic­to­ria Krakovna)

Tech­niques aiming at min­i­miz­ing nega­tive side effects in­clude min­i­miz­ing un­nec­es­sary dis­rup­tions when achiev­ing a goal (e.g. turn­ing Earth into pa­per­clips) or de­sign­ing low-im­pact agents (avoid­ing large side effects in gen­eral).

To cor­rectly mea­sure im­pact, sev­eral ques­tions must be an­swered:

  • How is change defined?

  • What was ac­tu­ally caused by the agent?

  • What was re­ally nec­es­sary to achieve the ob­jec­tive?

  • What are the im­plicit con­se­quences of the ob­jec­tive (e.g. a longer life ex­pec­tancy af­ter “cur­ing can­cer”)?

A “side-effect mea­sure” should pe­nal­ize un­nec­es­sary ac­tions (ne­ces­sity), un­der­stand what was caused by the agent vs. caused by the en­vi­ron­ment (cau­sa­tion) and pe­nal­ize ir­re­versible ac­tions (asym­me­try).

Hence, an agent may be pe­nal­ized for an out­come differ­ent from an “in­ac­tion baseline” (where the agent would not have done any­thing) or for any ir­re­versible ac­tion.

How­ever, those penalties in­tro­duce bad in­cen­tives to avoid ir­re­versible ac­tions but still let them hap­pen any­way (for in­stance pre­vent­ing a vase to be bro­ken to gain a re­ward, then break the vase any­way to go back to the “in­ac­tion baseline”). Rel­a­tive reach­a­bil­ity pro­vides an an­swer to this be­havi­our, by pe­nal­iz­ing the agent for mak­ing states less reach­able than there would be by de­fault (for in­stance break­ing a vase makes the states with an un­bro­ken vase un­reach­able) and leads to safe be­hav­iors in the Sokoban-like and con­veyor belt grid­wor­lds.

Open ques­tions about this ap­proach are:

  • How ex­actly should we com­pute the “in­ac­tion baseline” or the “de­fault state”?

  • How well could it work with AGI?


I thank Daniel Filan and Jaime Molina for their feed­back, and apol­o­gize for the talks I did not sum­ma­rize.