[AN #109]: Teaching neural nets to generalize the way humans would

Link post

Align­ment Newslet­ter is a weekly pub­li­ca­tion with re­cent con­tent rele­vant to AI al­ign­ment around the world. Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Au­dio ver­sion here (may not be up yet).


Bet­ter pri­ors as a safety prob­lem and Learn­ing the prior (Paul Chris­ti­ano) (sum­ma­rized by Ro­hin): Any ma­chine learn­ing al­gorithm (in­clud­ing neu­ral nets) has some in­duc­tive bias, which can be thought of as its “prior” over what the data it will re­ceive will look like. In the case of neu­ral nets (and any other gen­eral ML al­gorithm to date), this prior is sig­nifi­cantly worse than hu­man pri­ors, since it does not en­code e.g. causal rea­son­ing or logic. Even if we avoid pri­ors that de­pended on us pre­vi­ously see­ing data, we would still want to up­date on facts like “I think there­fore I am”. With a bet­ter prior, our ML mod­els would be able to learn more sam­ple effi­ciently. While this is so far a ca­pa­bil­ities prob­lem, there are two main ways in which it af­fects al­ign­ment.

First, as ar­gued in Inac­cessible in­for­ma­tion (AN #104), the reg­u­lar neu­ral net prior will learn mod­els which can pre­dict ac­cessible in­for­ma­tion. How­ever, our goals de­pend on in­ac­cessible in­for­ma­tion, and so we would have to do some “ex­tra work” in or­der to ex­tract the in­ac­cessible in­for­ma­tion from the learned mod­els in or­der to build agents that do what we want. This leads to a com­pet­i­tive­ness hit, rel­a­tive to agents whose goals de­pend only on ac­cessible in­for­ma­tion, and so dur­ing train­ing we might ex­pect to con­sis­tently get agents whose goals de­pend on ac­cessible in­for­ma­tion in­stead of the goals we ac­tu­ally want.

Se­cond, since the reg­u­lar neu­ral net prior is so weak, there is an in­cen­tive to learn a bet­ter prior, and then have that bet­ter prior perform the task. This is effec­tively an in­cen­tive for the neu­ral net to learn a mesa op­ti­mizer (AN #58), which need not be al­igned with us, and so would gen­er­al­ize differ­ently than we would, po­ten­tially catas­troph­i­cally.

Let’s for­mal­ize this a bit more. We have some ev­i­dence about the world, given by a dataset D = {(x1, y1), (x2, y2), …} (we as­sume that it’s a pre­dic­tion task—note that most self-su­per­vised tasks can be writ­ten in this form). We will later need to make pre­dic­tions on the dataset D’ = {x1′, x2′, …}, which may be from a “differ­ent dis­tri­bu­tion” than D (e.g. D might be about the past, while D’ is about the fu­ture). We would like to use D to learn some ob­ject Z that serves as a “prior”, such that we can then use Z to make good pre­dic­tions on D’.

The stan­dard ap­proach which we might call the “neu­ral net prior” is to train a model to pre­dict y from x us­ing the dataset D, and then ap­ply that model di­rectly to D’, hop­ing that it trans­fers cor­rectly. We can in­ject some hu­man knowl­edge by fine­tun­ing the model us­ing hu­man pre­dic­tions on D’, that is by train­ing the model on {(x1′, H(x1′)), (x2′, H(x2′)), …}. How­ever, this does not al­low H to up­date their prior based on the dataset D. (We as­sume that H can­not sim­ply read through all of D, since D is mas­sive.)

What we’d re­ally like is some way to get the pre­dic­tions H would make if they could up­date on dataset D. For H, we’ll imag­ine that a prior Z is given by some text de­scribing e.g. rules of logic, how to ex­trap­o­late trends, some back­ground facts about the world, em­piri­cal es­ti­mates of key quan­tities, etc. I’m now go­ing to talk about pri­ors over the prior Z, so to avoid con­fu­sion I’ll now call an in­di­vi­d­ual Z a “back­ground model”.

The key idea here is to struc­ture the rea­son­ing in a par­tic­u­lar way: H has a prior over back­ground mod­els Z, and then given Z, H’s pre­dic­tions for any given x_i are in­de­pen­dent of all of the other (x, y) pairs. In other words, once you’ve fixed your back­ground model of the world, your pre­dic­tion of y_i doesn’t de­pend on the value of y_j for some other x_j. Or to ex­plain it a third way, this is like hav­ing a set of hy­pothe­ses {Z}, and then up­dat­ing on each el­e­ment of D one by one us­ing Bayes Rule. In that case, the log pos­te­rior of a par­tic­u­lar back­ground model Z is given by log Prior(Z) + sum_i log P(y_i | x_i, Z) (ne­glect­ing a nor­mal­iza­tion con­stant).

The nice thing about this is the in­di­vi­d­ual terms Prior(Z) and P(y_i | x_i, Z) are all things that hu­mans can do, since they don’t re­quire the hu­man to look at the en­tire dataset D. In par­tic­u­lar, we can learn Prior(Z) by pre­sent­ing hu­mans with a back­ground model, and hav­ing them eval­u­ate how likely it is that the back­ground model is ac­cu­rate. Similarly, P(y_i | x_i, Z) sim­ply re­quires us to have hu­mans pre­dict y_i un­der the as­sump­tion that the back­ground facts in Z are ac­cu­rate. So, we can learn mod­els for both of these us­ing neu­ral nets. We can then find the best back­ground model Z-best by op­ti­miz­ing the equa­tion above, rep­re­sent­ing what H would think was the most likely back­ground model af­ter up­dat­ing on all of D. We can then learn a model for P(yi’ | xi’, Z-best) by train­ing on hu­man pre­dic­tions of yi’ given ac­cess to Z-best.

This of course only gets us to hu­man perfor­mance, which re­quires rel­a­tively small Z. If we want to have large back­ground mod­els al­low­ing for su­per­hu­man perfor­mance, we can use iter­ated am­plifi­ca­tion and de­bate to learn Prior(Z) and P(y | x, Z). There is some sub­tlety about how to rep­re­sent Z that I won’t go into here.

Ro­hin’s opinion: It seems to me like solv­ing this prob­lem has two main benefits. First, the model our AI sys­tem learns from data (i.e. Z-best) is in­ter­pretable, and in par­tic­u­lar we should be able to ex­tract the pre­vi­ously in­ac­cessible in­for­ma­tion that is rele­vant to our goals (which helps us build AI sys­tems that ac­tu­ally pur­sue those goals). Se­cond, AI sys­tems built in this way are in­cen­tivized to gen­er­al­ize in the same way that hu­mans do: in the scheme above, we learn from one dis­tri­bu­tion D, and then pre­dict on a new dis­tri­bu­tion D’, but ev­ery model learned with a neu­ral net is only used on the same dis­tri­bu­tion it was trained on.

Of course, while the AI sys­tem is in­cen­tivized to gen­er­al­ize the way hu­mans do, that does not mean it will gen­er­al­ize as hu­mans do—it is still pos­si­ble that the AI sys­tem in­ter­nally “wants” to gain power, and only in­stru­men­tally an­swers ques­tions the way hu­mans would an­swer them. So in­ner al­ign­ment is still a po­ten­tial is­sue. It seems pos­si­ble to me that what­ever tech­niques we use for deal­ing with in­ner al­ign­ment will also deal with the prob­lems of un­safe pri­ors as a side effect, in which case we may not end up need­ing to im­ple­ment hu­man-like pri­ors. (As the post notes, it may be much more difficult to use this ap­proach than to do the stan­dard “neu­ral net prior” ap­proach de­scribed above, so it would be nice to avoid it.)



Align­ment pro­pos­als and com­plex­ity classes (Evan Hub­inger) (sum­ma­rized by Ro­hin): The origi­nal de­bate (AN #5) pa­per showed that any prob­lem in PSPACE can be solved by op­ti­mal play in a de­bate game judged by a (prob­lem-spe­cific) al­gorithm in P. In­tu­itively, this is an illus­tra­tion of how the mechanism of de­bate can take a weak abil­ity (the abil­ity to solve ar­bi­trary prob­lems in P) and am­plify it into a stronger abil­ity (the abil­ity to solve ar­bi­trary prob­lems in PSPACE). One would hope that similarly, de­bate would al­low us to am­plify a hu­man’s prob­lem-solv­ing abil­ity into a much stronger prob­lem-solv­ing abil­ity.

This post ap­plies this tech­nique to sev­eral other al­ign­ment pro­pos­als. In par­tic­u­lar, for each pro­posal, we as­sume that the “hu­man” can be an ar­bi­trary polyno­mial-time al­gorithm, and the AI mod­els are op­ti­mal w.r.t their loss func­tions, and we ask which prob­lems we can solve us­ing these ca­pa­bil­ities. The post finds that, as lower bounds, the var­i­ous forms of am­plifi­ca­tion can ac­cess PSPACE, while mar­ket mak­ing (AN #108) can ac­cess EXP. If there are un­tam­per­a­ble poin­t­ers (so that the polyno­mial-time al­gorithm can look at ob­jects of an ar­bi­trary size, as long as it only looks at a polyno­mial-sized sub­set of them), then am­plifi­ca­tion and mar­ket mak­ing can ac­cess R (the set of de­cid­able prob­lems).

Ro­hin’s opinion: In prac­tice our mod­els are not go­ing to reach the op­ti­mal loss, and hu­mans won’t solve ar­bi­trary polyno­mial-time prob­lems, so these the­o­rems won’t di­rectly ap­ply to re­al­ity. Nonethe­less, this does seem like a worth­while check to do—it feels similar to en­sur­ing that a deep RL al­gorithm has a proof of con­ver­gence un­der ideal­ized as­sump­tions, even if those as­sump­tions won’t ac­tu­ally hold in re­al­ity. I have much more faith in a deep RL al­gorithm that started from one with a proof of con­ver­gence and then was mod­ified based on em­piri­cal con­sid­er­a­tions.

How should AI de­bate be judged? (Abram Dem­ski) (sum­ma­rized by Ro­hin): De­bate (AN #5) re­quires a hu­man judge to de­cide which of two AI de­baters should win the de­bate. How should the judge make this de­ci­sion? The dis­cus­sion on this page delves into this ques­tion in some depth.


What counts as defec­tion? (Alex Turner) (sum­ma­rized by Ro­hin): We of­ten talk about co­op­er­at­ing and defect­ing in gen­eral-sum games. This post pro­poses that we say that a player P has defected against a coal­i­tion C (that in­cludes P) cur­rently play­ing a strat­egy S when P de­vi­ates from the strat­egy S in a way that in­creases his or her own per­sonal util­ity, but de­creases the (weighted) av­er­age util­ity of the coal­i­tion. It shows that this defi­ni­tion has sev­eral nice in­tu­itive prop­er­ties: it im­plies that defec­tion can­not ex­ist in com­mon-pay­off games, uniformly weighted con­stant-sum games, or ar­bi­trary games with a Nash equil­ibrium strat­egy. A Pareto im­prove­ment can also never be defec­tion. It then goes on to show the op­por­tu­nity for defec­tion can ex­ist in the Pri­soner’s dilemma, Stag hunt, and Chicken (whether it ex­ists de­pends on the spe­cific pay­off ma­tri­ces).


En­vi­ron­ments as a bot­tle­neck in AGI de­vel­op­ment (Richard Ngo) (sum­ma­rized by Ro­hin): Models built us­ing deep learn­ing are a func­tion of the learn­ing al­gorithm, the ar­chi­tec­ture, and the task /​ en­vi­ron­ment /​ dataset. While a lot of effort is spent on an­a­lyz­ing learn­ing al­gorithms and ar­chi­tec­tures, not much is spent on the en­vi­ron­ment. This post asks how im­por­tant it is to de­sign a good en­vi­ron­ment in or­der to build AGI.

It con­sid­ers two pos­si­bil­ities: the “easy paths hy­poth­e­sis” that many en­vi­ron­ments would in­cen­tivize AGI, and the “hard paths hy­poth­e­sis” that such en­vi­ron­ments are rare. (Note that “hard paths” can be true even if an AGI would be op­ti­mal for most en­vi­ron­ments: if AGI would be op­ti­mal, but there is no path in the loss land­scape to AGI that is steeper than other paths in the loss land­scape, then we prob­a­bly wouldn’t find AGI in that en­vi­ron­ment.)

The main ar­gu­ment for “hard paths” is to look at the his­tory of AI re­search, where we of­ten trained agents on tasks that were “hal­l­marks of in­tel­li­gence” (like chess) and then found that the re­sult­ing sys­tems were nar­rowly good at the par­tic­u­lar task, but were not gen­er­ally in­tel­li­gent. You might think that it can’t be too hard, since our en­vi­ron­ment led to the cre­ation of gen­eral in­tel­li­gence (us), but this is sub­ject to an­thropic bias: only wor­lds with gen­eral in­tel­li­gence would ask whether en­vi­ron­ments in­cen­tivize gen­eral in­tel­li­gence, so they will always ob­serve that their en­vi­ron­ment is an ex­am­ple that in­cen­tivizes gen­eral in­tel­li­gence. It can serve as a proof of ex­is­tence, but not as an in­di­ca­tor that it is par­tic­u­larly likely.

Ro­hin’s opinion: I think this is an im­por­tant ques­tion for AI timelines, and the plau­si­bil­ity of “hard paths” is one of the cen­tral rea­sons that my timelines are longer than oth­ers who work on deep learn­ing-based AGI. How­ever, GPT-3 (AN #102) demon­strates quite a lot of gen­er­al­ity, so re­cently I’ve started putting more weight on “ac­tu­ally, de­sign­ing the en­vi­ron­ment won’t be too hard”, which has cor­re­spond­ingly short­ened my timelines.


Talk: Key Is­sues In Near-Term AI Safety Re­search (Aryeh Englan­der) (sum­ma­rized by Ro­hin): This talk points out syn­er­gies be­tween long-term AI safety and the ex­ist­ing fields of as­sured au­ton­omy, safety en­g­ineer­ing, and test­ing, eval­u­a­tion, ver­ifi­ca­tion and val­i­da­tion (TEV&V), pri­mar­ily by show­ing how they fit into and ex­pand Deep­Mind’s frame­work of speci­fi­ca­tion, ro­bust­ness and as­surance (AN #26).



Us­ing Selec­tive At­ten­tion in Re­in­force­ment Learn­ing Agents (Yu­jin Tang et al) (sum­ma­rized by Sud­han­shu): Re­cently win­ning a best pa­per award at GECCO 2020, this work marks a leap for­ward in the perfor­mance ca­pa­bil­ities learned by small agents via evolu­tion­ary meth­ods. Speci­fi­cally, it shows that by jointly learn­ing which small frac­tion of in­put to at­tend to, agents with only thou­sands of free pa­ram­e­ters can be trained by an evolu­tion­ary strat­egy to achieve state-of-the-art perfor­mance in vi­sion-based con­trol tasks.

The key pieces in­clude self-at­ten­tion over in­put patches, non-differ­en­tiable top-K patch se­lec­tion that effect ‘inat­ten­tional blind­ness’, and train­ing via CMA-ES. By de­sign, the agent is in­ter­pretable as the top-K patches that are se­lected can be ex­am­ined. Em­piri­cally, the agent has 1000x fewer weights than a com­pet­ing neu­ral ar­chi­tec­ture, and the method shows ro­bust­ness to changes in task-ir­rele­vant in­puts, as the agent learns to fo­cus only on task-rele­vant patches.

Read more: Paper: Neu­roevolu­tion of Self-In­ter­pretable Agents

Sud­han­shu’s opinion: The par­allelism af­forded by evolu­tion­ary meth­ods and ge­netic al­gorithms might be valuable in an en­vi­ron­ment where weak com­pute is plen­tiful, so it’s ex­cit­ing to see ev­i­dence of such meth­ods best­ing GPU-hun­gry deep neu­ral net­works. How­ever, I won­der how this would do on sparse re­ward tasks, where the fit­ness func­tion is al­most always un­in­for­ma­tive. Fi­nally, while it gen­er­al­ises to set­tings where there are task-ir­rele­vant dis­trac­tions, its de­liber­ately sharp self-at­ten­tion likely leaves it vuln­er­a­ble to even sim­ple ad­ver­sar­ial at­tacks.

Im­prov­ing Sam­ple Effi­ciency in Model-Free Re­in­force­ment Learn­ing from Images (De­nis Yarats et al) (sum­ma­rized by Flo): Sam­ple effi­ciency in RL can be im­proved by us­ing off-policy meth­ods that can reuse the same sam­ple mul­ti­ple times and by us­ing self-su­per­vised aux­iliary losses that help with rep­re­sen­ta­tion learn­ing, es­pe­cially when re­wards are sparse. This work com­bines both ap­proaches by propos­ing to learn a la­tent state rep­re­sen­ta­tion us­ing an au­toen­coder while jontly train­ing an agent on that la­tent rep­re­sen­ta­tion us­ing SAC (AN #42). Pre­vi­ous work in the on-policy case shows a pos­i­tive effect from prop­a­gat­ing Ac­tor-Critic gra­di­ents through the en­coder to im­prove the use­ful­ness of the en­cod­ing for policy learn­ing. How­ever, this desta­bi­lizes train­ing in the off-policy case, as chang­ing the en­cod­ing to fa­cil­i­tate the ac­tor also changes the Q-func­tion es­ti­mate, which in turn changes the ac­tor’s goal and can in­tro­duce non­sta­tion­ar­ity. This prob­lem is cir­cum­vented by only prop­a­gat­ing the Q-net­work’s gra­di­ents through the en­coder while block­ing the ac­tor’s gra­di­ents.

The method strongly out­performs SAC trained on pix­els. It also matches the pre­vi­ous state-of-the-art set by model-based ap­proaches on an image-based con­tin­u­ous con­trol task and out­performs them for noisy ob­ser­va­tions (as these make dy­nam­ics mod­els hard to learn). The au­thors also find that the learnt en­cod­ings gen­er­al­ize be­tween tasks to some ex­tent and that re­con­struct­ing the true en­vi­ron­ment state is eas­ier us­ing their la­tent rep­re­sen­ta­tion than us­ing a rep­re­sen­ta­tion ob­tained by train­ing SAC on pix­els di­rectly.

Flo’s opinion: Meth­ods like this that can benefit from see­ing a lot of ac­tion-in­de­pen­dent en­vi­ron­ment ob­ser­va­tions might be quite im­por­tant for ap­ply­ing RL to the real world, as this type of data is a lot cheaper to gen­er­ate. For ex­am­ple, we can eas­ily gen­er­ate a ton of ob­ser­va­tions from a fac­tory by equip­ping work­ers with cam­eras, but state-ac­tion-next-state triples from a robot in­ter­act­ing with the fac­tory are very costly to ob­tain.


I’m always happy to hear feed­back; you can send it to me, Ro­hin Shah, by re­ply­ing to this email.


An au­dio pod­cast ver­sion of the Align­ment Newslet­ter is available. This pod­cast is an au­dio ver­sion of the newslet­ter, recorded by Robert Miles.