[AN #57] Why we should focus on robustness in AI safety, and the analogous problems in programming

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by com­ment­ing on this post.


De­sign­ing ro­bust & re­li­able AI sys­tems and how to suc­ceed in AI (Rob Wiblin and Push­meet Kohli): (As is typ­i­cal for large con­tent, I’m only sum­ma­riz­ing the most salient points, and ig­nor­ing en­tire sec­tions of the pod­cast that didn’t seem as rele­vant.)

In this pod­cast, Rob delves into the de­tails of Push­meet’s work on mak­ing AI sys­tems ro­bust. Push­meet doesn’t view AI safety and AI ca­pa­bil­ities as par­tic­u­larly dis­tinct—part of build­ing a good AI sys­tem is en­sur­ing that the sys­tem is safe, ro­bust, re­li­able, and gen­er­al­izes well. Other­wise, it won’t do what we want, so why would we even bother us­ing it. He aims to im­prove ro­bust­ness by ac­tively search­ing for be­hav­iors that vi­o­late the speci­fi­ca­tion, or by for­mally ver­ify­ing par­tic­u­lar prop­er­ties of the neu­ral net. That said, he also thinks that one of the ma­jor challenges here is in figur­ing out the speci­fi­ca­tion of what to ver­ify in the first place.

He sees the prob­lems in AI as be­ing similar to the ones that arise in pro­gram­ming and com­puter se­cu­rity. In pro­gram­ming, it is of­ten the case that the pro­gram that one writes down does not ac­cu­rately match the in­tended speci­fi­ca­tion, lead­ing to bugs. Often we sim­ply ac­cept that these bugs hap­pen, but for se­cu­rity crit­i­cal sys­tems such as traf­fic lights we can use tech­niques like test­ing, fuzzing, sym­bolic ex­e­cu­tion, and for­mal ver­ifi­ca­tion that al­low us to find these failures in pro­grams. We now need to de­velop these tech­niques for ma­chine learn­ing sys­tems.

The anal­ogy can go much fur­ther. Static anal­y­sis in­volves un­der­stand­ing prop­er­ties of a pro­gram sep­a­rately from any in­puts, while dy­namic anal­y­sis in­volves un­der­stand­ing a pro­gram with a spe­cific in­put. Similarly, we can have “static” in­ter­pretabil­ity, which un­der­stands the model as a whole (as in Fea­ture vi­su­al­iza­tion), or “dy­namic” in­ter­pretabil­ity, which ex­plains the model’s out­put for a par­tic­u­lar in­put. Another ex­am­ple is that the tech­nique of ab­stract in­ter­pre­ta­tion of pro­grams is analo­gous to a par­tic­u­lar method for ver­ify­ing prop­er­ties of neu­ral nets.

This anal­ogy sug­gests that we have faced the prob­lems of AI safety be­fore, and have made sub­stan­tial progress on them; the challenge is now in do­ing it again but with ma­chine learn­ing sys­tems. That said, there are some prob­lems that are unique to AGI-type sys­tems; it’s just not the speci­fi­ca­tion prob­lem. For ex­am­ple, it is ex­tremely un­clear how we should com­mu­ni­cate with such a sys­tem, which may have its own con­cepts and mod­els that are very differ­ent from those of hu­mans. We could try to use nat­u­ral lan­guage, but if we do we need to ground the nat­u­ral lan­guage in the way that hu­mans do, and it’s not clear how we could do that, though per­haps we could test if the learned con­cepts gen­er­al­ize to new set­tings. We could also try to look at the weights of our ma­chine learn­ing model and an­a­lyze whether it has learned the con­cept—but only if we already have a for­mal speci­fi­ca­tion of the con­cept, which seems hard to get.

Ro­hin’s opinion: I re­ally like the anal­ogy be­tween pro­gram­ming and AI; a lot of my thoughts have been shaped by think­ing about this anal­ogy my­self. I agree that the anal­ogy im­plies that we are try­ing to solve prob­lems that we’ve at­tacked be­fore in a differ­ent con­text, but I do think there are sig­nifi­cant differ­ences now. In par­tic­u­lar, with long-term AI safety we are con­sid­er­ing a set­ting in which mis­takes can be ex­tremely costly, and we can’t provide a for­mal speci­fi­ca­tion of what we want. Con­trast this to traf­fic lights, where mis­takes can be ex­tremely costly but I’m guess­ing we can provide a for­mal speci­fi­ca­tion of the safety con­straints that need to be obeyed. To be fair, Push­meet ac­knowl­edges this and high­lights speci­fi­ca­tion learn­ing as a key area of re­search, but to me it feels like a qual­i­ta­tive differ­ence from pre­vi­ous prob­lems we’ve faced, whereas I think Push­meet would dis­agree with that (but I’m not sure why).

Read more: Towards Ro­bust and Ver­ified AI: Speci­fi­ca­tion Test­ing, Ro­bust Train­ing, and For­mal Ver­ifi­ca­tion (AN #52)

Tech­ni­cal AI alignment

Learn­ing hu­man intent

Per­cep­tual Values from Ob­ser­va­tion (Ash­ley D. Ed­wards et al) (sum­ma­rized by Cody): This pa­per pro­poses a tech­nique for learn­ing from raw ex­pert-tra­jec­tory ob­ser­va­tions by as­sum­ing that the last state in the tra­jec­tory is the state where the goal was achieved, and that other states have value in pro­por­tion to how close they are to a ter­mi­nal state in demon­stra­tion tra­jec­to­ries. They use this as a ground­ing to train mod­els pre­dict­ing value and ac­tion-value, and then use these es­ti­mated val­ues to de­ter­mine ac­tions.

Cody’s opinion: This idea definitely gets points for be­ing a clear and easy-to-im­ple­ment heuris­tic, though I worry it may have trou­ble with videos that don’t match its goal-di­rected as­sump­tion.

Del­ega­tive Re­in­force­ment Learn­ing (Vanessa Kosoy): Con­sider en­vi­ron­ments that have “traps”: states that per­ma­nently cur­tail the long-term value that an agent can achieve. A world with­out hu­mans could be one such trap. Traps could also hap­pen af­ter any ir­re­versible ac­tion, if the new state is not as use­ful for achiev­ing high re­wards as the old state.

In such an en­vi­ron­ment, an RL al­gorithm could sim­ply take no ac­tions, in which case it in­curs re­gret that is lin­ear in the num­ber of timesteps so far. (Re­gret is the differ­ence be­tween the ex­pected re­ward un­der the op­ti­mal policy and the policy ac­tu­ally ex­e­cuted, so if the av­er­age re­ward per timestep of the op­ti­mal policy is 2 and do­ing noth­ing is always re­ward 0, then the re­gret will be ~2T where T is the num­ber of timesteps, so re­gret is lin­ear in the num­ber of timesteps.) Can we find an RL al­gorithm that will guaran­tee re­gret sub­lin­ear in the num­ber of timesteps, re­gard­less of the en­vi­ron­ment?

Un­sur­pris­ingly, this is im­pos­si­ble, since dur­ing ex­plo­ra­tion the RL agent could fall into a trap, which leads to lin­ear re­gret. How­ever, let’s sup­pose that we could del­e­gate to an ad­vi­sor who knows the en­vi­ron­ment: what must be true about the ad­vi­sor for us to do bet­ter? Clearly, the ad­vi­sor must be able to always avoid traps (oth­er­wise the same prob­lem oc­curs). How­ever, this is not enough: get­ting sub­lin­ear re­gret also re­quires us to ex­plore enough to even­tu­ally find the op­ti­mal policy. So, the ad­vi­sor must have at least some small prob­a­bil­ity of be­ing op­ti­mal, which the agent can then learn from. This pa­per proves that with these as­sump­tions there does ex­ist an al­gorithm that is guaran­teed to get sub­lin­ear re­gret.

Ro­hin’s opinion: It’s in­ter­est­ing to see what kinds of as­sump­tions are nec­es­sary in or­der to get AI sys­tems that can avoid catas­troph­i­cally bad out­comes, and the no­tion of “traps” seems like a good way to for­mal­ize this. I worry about there be­ing a Carte­sian bound­ary be­tween the agent and the en­vi­ron­ment, though per­haps even here as long as the ad­vi­sor is aware of prob­lems caused by such a bound­ary, they can be mod­eled as traps and thus avoided.

Of course, if we want the ad­vi­sor to be a hu­man, both of the as­sump­tions are un­re­al­is­tic, but I be­lieve Vanessa’s plan is to make the as­sump­tions more re­al­is­tic in or­der to see what as­sump­tions are ac­tu­ally nec­es­sary.

One thing I won­der about is whether the fo­cus on traps is nec­es­sary. With the pres­ence of traps in the the­o­ret­i­cal model, one of the main challenges is in pre­vent­ing the agent from fal­ling into a trap due to ig­no­rance. How­ever, it seems ex­tremely un­likely that an AI sys­tem man­ages to take some ir­re­versible catas­trophic ac­tion by ac­ci­dent—I’m much more wor­ried about the case where the AI sys­tem is ad­ver­sar­i­ally op­ti­miz­ing against us and in­ten­tion­ally takes an ir­re­versible catas­trophic ac­tion.

Re­ward learn­ing theory

By de­fault, avoid am­bigu­ous dis­tant situ­a­tions (Stu­art Arm­strong)

Han­dling groups of agents

PRECOG: PRE­dic­tion Con­di­tioned On Goals in Vi­sual Multi-Agent Set­tings (Ni­cholas Rhine­hart et al) (sum­ma­rized by Cody): This pa­per mod­els a multi-agent self driv­ing car sce­nario by de­vel­op­ing a model of fu­ture states con­di­tional on both its own ac­tion and the ac­tion of mul­ti­ple hu­mans, and pick­ing the la­tent-space ac­tion that bal­ances be­tween the desider­ata of reach­ing its goal and prefer­ring tra­jec­to­ries seen in the ex­pert multi-agent tra­jec­to­ries its shown (where, e.g., two hu­man agents rarely crash into one an­other).

Mis­cel­la­neous (Align­ment)

Re­in­force­ment learn­ing with im­per­cep­ti­ble re­wards (Vanessa Kosoy): Typ­i­cally in re­in­force­ment learn­ing, the re­ward func­tion is defined over ob­ser­va­tions and ac­tions, rather than di­rectly on states, which en­sures that the re­ward can always be calcu­lated. How­ever, in re­al­ity we care about un­der­ly­ing as­pects of the state that may not eas­ily be com­puted from ob­ser­va­tions. We can’t guaran­tee sub­lin­ear re­gret, since if you are un­sure about the re­ward in some un­ob­serv­able part of the state that your ac­tions nonethe­less af­fect, then you can never learn the re­ward and ap­proach op­ti­mal­ity.

To fix this, we can work with re­wards that are re­stricted to in­stru­men­tal states only. I don’t un­der­stand ex­actly how these work, since I don’t know the math used in the for­mal­iza­tion, but I be­lieve the idea is for the set of in­stru­men­tal states to be defined such that for any two in­stru­men­tal states, there ex­ists some “ex­per­i­ment” that the agent can run in or­der to dis­t­in­guish be­tween the states in some finite time. The main point of this post is that we can es­tab­lish a re­gret bound for MDPs (not POMDPs yet), as­sum­ing that there are no traps.

AI strat­egy and policy

Beijing AI Prin­ci­ples: Th­ese prin­ci­ples are a col­lab­o­ra­tion be­tween Chi­nese academia and in­dus­try, and hit upon many of the prob­lems sur­round­ing AI dis­cussed to­day, in­clud­ing fair­ness, ac­countabil­ity, trans­parency, di­ver­sity, job au­toma­tion, re­spon­si­bil­ity, ethics, etc. Notably for long-ter­mists, it speci­fi­cally men­tions con­trol risks, AGI, su­per­in­tel­li­gence, and AI races, and calls for in­ter­na­tional col­lab­o­ra­tion in AI gov­er­nance.

Read more: Beijing pub­lishes AI eth­i­cal stan­dards, calls for int’l cooperation

Other progress in AI

Deep learning

De­con­struct­ing Lot­tery Tick­ets: Zeros, Signs, and the Su­per­mask (Hat­tie Zhou, Jan­ice Lan, Rosanne Liu et al) (sum­ma­rized by Cody): This pa­per runs a se­ries of ex­per­i­men­tal ab­la­tion stud­ies to bet­ter un­der­stand the limits of the Lot­tery Ticket Hy­poth­e­sis, and in­ves­ti­gate var­i­ants of the ini­tial prun­ing and mask­ing pro­ce­dure un­der which its effects are more and less pro­nounced. It is first and fore­most a list of in­ter­est­ing re­sults, with­out any cen­tral the­ory ty­ing them to­gether. Th­ese re­sults in­clude the ob­ser­va­tion that keep­ing pruned weights the same sign as their “lot­tery ticket” ini­tial­iza­tion seems more im­por­tant than keep­ing their ex­act ini­tial mag­ni­tudes, that tak­ing a mixed strat­egy of ze­ro­ing pruned weights or freez­ing them at ini­tial­iza­tion can get bet­ter re­sults, and that ap­ply­ing a learned 01 mask to a re-ini­tial­ized net­work can get sur­pris­ingly high ac­cu­racy even with­out re-train­ing.

Cody’s opinion: While it cer­tainly would have been ex­cit­ing to have a pa­per pre­sent­ing a unified (and em­piri­cally sup­ported) the­o­ret­i­cal un­der­stand­ing of the LTH, I re­spect the fact that this is such a purely em­piri­cal work, that tries to do one thing—de­sign­ing and run­ning clean, clear ex­per­i­ments—and does it well, with­out try­ing to con­struct ex­pla­na­tions just for the sake of hav­ing them. We still have a ways to go in un­der­stand­ing the op­ti­miza­tion dy­nam­ics un­der­ly­ing lot­tery tick­ets, but these seem like im­por­tant and valuable data points on the road to that un­der­stand­ing.

Read more: Cody’s longer summary


Challenges of Real-World Re­in­force­ment Learn­ing (Gabriel Du­lac-Arnold et al) (sum­ma­rized by Cody): This pa­per is a fairly clear and well-done liter­a­ture re­view fo­cus­ing on the difficul­ties that will need to be over­come in or­der to train and de­ploy re­in­force­ment learn­ing on real-world prob­lems. They de­scribe each of these challenges—which range from slow simu­la­tion speeds, to the need to fre­quently learn off-policy, to the im­por­tance of safety in real world sys­tems—and for each pro­pose or re­fer to an ex­ist­ing met­ric to cap­ture how well a given RL model ad­dresses the challenge. Fi­nally, they pro­pose a mod­ified ver­sion of a hu­manoid en­vi­ron­ment with some of these real-world-style challenges baked in, and en­courage other re­searchers to test sys­tems within this frame­work.

Cody’s opinion: This is a great in­tro­duc­tion and overview for peo­ple who want to bet­ter un­der­stand the gaps be­tween cur­rent RL and prac­ti­cally de­ploy­able RL. I do wish the au­thors had spent more time ex­plain­ing and clar­ify­ing the de­sign of their pro­posed testbed sys­tem, since the de­scrip­tions of it are all fairly high level.


Offer of col­lab­o­ra­tion and/​or men­tor­ship (Vanessa Kosoy): This is ex­actly what it sounds like. You can find out more about Vanessa’s re­search agenda from The Learn­ing-The­o­retic AI Align­ment Re­search Agenda (AN #13), and I’ve sum­ma­rized two of her re­cent posts in this newslet­ter.

Hu­man-al­igned AI Sum­mer School (Jan Kul­veit et al): The sec­ond Hu­man-al­igned AI Sum­mer School will be held in Prague from July 25-28, with a fo­cus on “op­ti­miza­tion and de­ci­sion-mak­ing”. Ap­pli­ca­tions are due June 15.

Open Phil AI Fel­low­ship — 2019 Class: The Open Phil AI Fel­lows for this year have been an­nounced! Con­grat­u­la­tions to all of the fel­lows :)

TAISU—Tech­ni­cal AI Safety Un­con­fer­ence (Linda Linse­fors)

Learn­ing-by-do­ing AI Safety work­shop (Linda Linse­fors)