Benign model-free RL

In my last post, I de­scribed three re­search ar­eas in AI con­trol that I see as cen­tral: re­ward learn­ing, ro­bust­ness, and de­liber­a­tion.

In this post I ar­gue that these three pieces may be suffi­cient to get a be­nign and com­pet­i­tive ver­sion of model-free re­in­force­ment learn­ing. I think this is an im­por­tant in­ter­me­di­ate goal of solv­ing AI con­trol.

This post doesn’t dis­cuss be­nign model-based RL at all, which I think is an­other key ob­sta­cle for pro­saic AI con­trol.

(This post over­laps ex­ten­sively with my post on ALBA, but I hope this one will be much clearer. Tech­ni­cally, ALBA is an im­ple­men­ta­tion of the gen­eral strat­egy out­lined in this post. I think the gen­eral strat­egy is much more im­por­tant than that par­tic­u­lar im­ple­men­ta­tion.)


Re­ward learn­ing and robustness

Given a be­nign agent H, re­ward learn­ing al­lows us to con­struct a re­ward func­tion r that can be used to train a weaker be­nign agent A. If our train­ing pro­cess is ro­bust, the re­sult­ing agent A will re­main be­nign off of the train­ing dis­tri­bu­tion (though it may be in­com­pe­tent off of the train­ing dis­tri­bu­tion).

Schemat­i­cally, we can think of re­ward learn­ing + ro­bust­ness as a wid­get which takes a slow, be­nign pro­cess H and pro­duces a fast, be­nign pro­cess A

A’s ca­pa­bil­ities should be roughly the “in­ter­sec­tion” of H’s ca­pa­bil­ities and our RL al­gorithms’ com­pe­tence. That is, A should be able to perform a task when­ever both H can perform that task and our RL al­gorithms can learn to perform that task.

In these pic­tures, the ver­ti­cal axis cor­re­sponds in­tu­itively to “ca­pa­bil­ity,” with higher agents be­ing more ca­pa­ble. But in re­al­ity I’m think­ing of the pos­si­ble ca­pa­bil­ities as form­ing a com­plete lat­tice. That is, a generic pair of lev­els of ca­pa­bil­ities is in­com­pa­rable, with nei­ther strictly dom­i­nat­ing the other.


If we iter­a­tively ap­ply re­ward learn­ing and ro­bust­ness, we will ob­tain a se­quence of weaker and weaker agents. To get any­where, we need some mechanism that lets us pro­duce a stronger agent.

The ca­pa­bil­ity am­plifi­ca­tion prob­lem is to start with a weak agent A and a hu­man ex­pert H, and to pro­duce a sig­nifi­cantly more ca­pa­ble agent Hᴬ. The more ca­pa­ble agent can take a lot longer to think, all we care about is that it even­tu­ally ar­rives at bet­ter de­ci­sions than A. The key challenge is en­sur­ing that Hᴬ re­mains be­nign, i.e. that the sys­tem doesn’t ac­quire new prefer­ences as it be­comes more ca­pa­ble.

An ex­am­ple ap­proach is to provide A as an as­sis­tant to H. We can give H an hour to de­liber­ate, and let it con­sult A thou­sands of times dur­ing that hour. Hᴬ’s out­put is then what­ever H out­puts at the end of that pro­cess. Be­cause H is con­sult­ing A a large num­ber of times, we can hope that the re­sult­ing sys­tem will be much smarter than A. Of course, the re­sult­ing sys­tem will be thou­sands of times more com­pu­ta­tion­ally ex­pen­sive than A, but that’s fine.

In gen­eral, meta-ex­e­cu­tion is my cur­rent preferred ap­proach to ca­pa­bil­ity am­plifi­ca­tion.

Schemat­i­cally, we can think of am­plifi­ca­tion as a wid­get which takes a fast, be­nign pro­cess A and pro­duces a slow, be­nign pro­cess Hᴬ:

Put­ting it together

With these two wid­gets in hand, we can iter­a­tively pro­duce a se­quence of in­creas­ingly com­pe­tent agents:

That is, we start with our be­nign ex­pert H. We then learn a re­ward func­tion and train an agent A, which is less ca­pa­ble than H but can run much faster. By run­ning many in­stances of A, we ob­tain a more pow­er­ful agent Hᴬ, which is ap­prox­i­mately as ex­pen­sive as H.

We can then re­peat the pro­cess, us­ing Hᴬ to train an agent A⁺ which runs as fast as A but is more ca­pa­ble. By run­ning A⁺ for a long time we ob­tain a still more ca­pa­ble agent Hᴬ⁺, and the cy­cle re­peats.

Col­laps­ing the recursion

I’ve de­scribed an ex­plicit se­quence of in­creas­ingly ca­pa­ble agents. This is the most con­ve­nient frame­work for anal­y­sis, but ac­tu­ally im­ple­ment­ing a se­quence of dis­tinct agents might in­tro­duce sig­nifi­cant over­head. It also feels at odds with cur­rent prac­tice, such that I would be in­tu­itively sur­prised to ac­tu­ally see it work out.

In­stead, we can col­lapse the en­tire se­quence to a sin­gle agent:

In this ver­sion there is a sin­gle agent A which is si­mul­ta­neously be­ing trained and be­ing used to define a re­ward func­tion.

Alter­na­tively, we can view this as a se­quen­tial scheme with a strong ini­tial­iza­tion: there is a sep­a­rate agent at each time t, who over­sees the agent at time t+1, but each agent is ini­tial­ized us­ing the pre­vi­ous one’s state.

This ver­sion of the scheme is more likely to be effi­cient, and it feels much closer to a prac­ti­cal frame­work for RL. (I origi­nally sug­gested a similar scheme here.)

How­ever, in ad­di­tion to com­pli­cat­ing the anal­y­sis, it also in­tro­duces ad­di­tional challenges and risks. For ex­am­ple, if Hᴬ ac­tu­ally con­sults A, then there are unattrac­tive equil­ibria in which A ma­nipu­lates the re­ward func­tion, and the ma­nipu­lated re­ward func­tion re­wards ma­nipu­la­tion. Avert­ing this prob­lem ei­ther re­quires H to some­times avoid de­pend­ing on A, or else re­quires us to some­times run against an old ver­sion of A (a trick some­times used to sta­bi­lize self-play). Both of these tech­niques im­plic­itly rein­tro­duce the iter­a­tive struc­ture of the origi­nal scheme, though they may do so with lower com­pu­ta­tional over­head.

We will have an even more se­ri­ous prob­lem if our ap­proach to re­ward learn­ing re­lied on throt­tling the learn­ing al­gorithm. When we work with an ex­plicit se­quence of agents, we can en­sure that their ca­pa­bil­ities im­prove grad­u­ally. It’s not straight­for­ward to do some­thing analo­gous in the sin­gle agent case.

Over­all I think this ver­sion of the scheme is more likely to be prac­ti­cal. But it in­tro­duces sev­eral ad­di­tional com­pli­ca­tions, and I think it’s rea­son­able to start by con­sid­er­ing the ex­plicit se­quen­tial form un­til we have a solid grasp of it.


I’ll make two crit­i­cal claims about this con­struc­tion. Nei­ther claim has yet been for­mal­ized, and it’s not clear whether it will be pos­si­ble to for­mal­ize them com­pletely.

Claim #1: All of these agents are be­nign.

This is plau­si­ble by in­duc­tion:

  • The origi­nal ex­pert H is be­nign by defi­ni­tion.

  • If we start with a be­nign over­seer H, and have work­ing solu­tions to re­ward learn­ing + ro­bust­ness, then the trained agent A is be­nign.

  • If we start with a be­nign agent A, and have a wok­ing solu­tion to ca­pa­bil­ity am­plifi­ca­tion, then the am­plified agent Hᴬ will be be­nign.

There are im­por­tant sub­tleties in this ar­gu­ment; for ex­am­ple, an agent may be be­nign with high prob­a­bil­ity, and the er­ror prob­a­bil­ity may in­crease ex­po­nen­tially as we pro­ceed through the in­duc­tion. Deal­ing with these sub­tleties will re­quire care­ful defi­ni­tions, and in some cases ad­just­ments to the al­gorithm. For ex­am­ple, in the case of in­creas­ing failure prob­a­bil­ities, we need to strengthen the state­ment of am­plifi­ca­tion to avoid the prob­lem.

Claim #2: The fi­nal agent has state-of-the-art perfor­mance.

This is plau­si­ble if our build­ing blocks satisfy sev­eral de­sir­able prop­er­ties.

First, ca­pa­bil­ity am­plifi­ca­tion should be able to cross ev­ery level non-max­i­mal level of ca­pa­bil­ity. That is, for ev­ery level of ca­pa­bil­ity, it is pos­si­ble to start with an agent A who is be­low that level, and end up with an agent Hᴬ which is above that level:

For ev­ery pos­si­ble place we could put the dot­ted line — ev­ery pos­si­ble ca­pa­bil­ity level — there must be some agent A for whom the or­ange ar­row crosses that dot­ted line. Other­wise we would never be able to get to the other side of that dot­ted line, i.e. we would never be able to sur­pass that level of ca­pa­bil­ity.

Se­cond, ca­pa­bil­ity am­plifi­ca­tion should be mono­tonic (if A is at least as ca­pa­ble as B then Hᴬ should be at least as ca­pa­ble as Hᴮ).

Third, re­ward learn­ing should yield an agent whose ca­pa­bil­ities are at least the in­fi­mum of our RL al­gorithm’s ca­pa­bil­ities and the over­seer’s ca­pa­bil­ities, even if we train ro­bustly.

Now given a se­quence of in­creas­ingly pow­er­ful fast agents we can take the supre­mum of their ca­pa­bil­ities. Those agents will all be weaker than our RL al­gorithms and so the supre­mum is not the max­i­mal ca­pa­bil­ity, so we can con­sider a start­ing point from which ca­pa­bil­ity am­plifi­ca­tion would cross that supre­mum. By hy­poth­e­sis the se­quence must even­tu­ally cross this start­ing point, and at that point am­plifi­ca­tion will push it above the supre­mum (and re­ward learn­ing will keep it above the supre­mum). Mak­ing this ar­gu­ment care­fully shows that the supre­mum is the state of the art for RL al­gorithms and that we at­tain the supre­mum af­ter some finite num­ber of steps. (Though all of this is based on a leaky ab­strac­tion of “ca­pa­bil­ities.”)


I think this pro­posal will be most helpful if it im­poses min­i­mal ad­di­tional over­head. My main goal is to de­velop al­gorithms with sub­lin­ear over­head, i.e. for which the frac­tion of over­head con­verges to 0 as the un­der­ly­ing al­gorithms be­come stronger.

The cost of this scheme de­pends on the quan­ti­ta­tive prop­er­ties of our ba­sic build­ing blocks:

Fac­tor #1: How much do re­ward learn­ing and ro­bust­ness slow down train­ing?

Dur­ing RL, we need to eval­u­ate the agent A many times. If we want to use a learned re­ward func­tion we may need to eval­u­ate A more times. And if we want to train a policy which re­mains be­nign off of the train­ing dis­tri­bu­tion, we may need to eval­u­ate A more times (e.g. since we may need to do ad­ver­sar­ial train­ing). Ideally that over­head will shrink as our al­gorithms be­come more pow­er­ful.

I think this is plau­si­ble but far from cer­tain (for now it is un­cer­tain whether re­ward learn­ing and ro­bust­ness are even plau­si­ble). Some re­as­sur­ing fac­tors:

  • Re­ward learn­ing /​ ad­ver­sar­ial train­ing can ac­tu­ally im­prove the perfor­mance of our sys­tem — the com­pu­ta­tional time spent on them might ac­tu­ally be well-spent even from a ca­pa­bil­ities perspective

  • The difficulty of the “ad­di­tional learn­ing prob­lem” we are try­ing to solve in each case (e.g. the con­cept of “defer to hu­man con­trol”) may not scale up lin­early with the com­plex­ity of the un­der­ly­ing do­main.

Fac­tor #2: how many times do we have to in­voke the over­seer dur­ing train­ing?

In ad­di­tion to call­ing the agent A, we will need to call the over­seer H in or­der to get in­for­ma­tion about the re­ward func­tion. Be­cause the over­seer is much more ex­pen­sive than the agent, we would like to min­i­mize the num­ber of times we call the over­seer. This can be quan­tified by the ra­tio be­tween the num­ber of calls to H and the num­ber of calls to A. For ex­am­ple, we may need to call H once for ev­ery hun­dred calls to A.

Fac­tor #3: how ex­pen­sive is ca­pa­bil­ity am­plifi­ca­tion?

Ca­pa­bil­ity am­plifi­ca­tion is pos­si­ble only be­cause we al­low the agent Hᴬ to think for much longer than A. But “much longer” could rep­re­sent a range of val­ues: is Hᴬ a hun­dred times more ex­pen­sive to eval­u­ate than A? A thou­sand? A mil­lion?

Roughly speak­ing, fac­tors #2 and #3 should be mul­ti­plied to­gether to get the over­head from re­ward learn­ing: fac­tor #2 tells us how many times we have to call the over­seer, while fac­tor #3 tells us how ex­pen­sive the over­seer is.

The to­tal over­head is thus (Fac­tor #1) + (Fac­tor #2) * (Fac­tor #3). As an ex­am­ple, I’d be happy with val­ues like 10% + 0.01% × 1000 = 20%.

Fac­tor #4: do we need to train many sep­a­rate agents?

If we need to use a se­quence of N in­creas­ingly ca­pa­ble agents, then we would naively in­crease our train­ing time by a fac­tor of N. Naively, this would dom­i­nate the over­head, and in or­der for the scheme to be work­able I think we would need to avoid it. I see a few plau­si­ble ap­proaches:

  • We could use the col­lapsed ver­sion with a sin­gle agent.

  • We could use some other ini­tial­iza­tion or pa­ram­e­ter-shar­ing scheme to effec­tively reuse the com­pu­ta­tional work done in train­ing ear­lier agents.

  • The ear­lier agents could re­quire sig­nifi­cantly less train­ing time than the fi­nal agent, e.g. be­cause they are less ca­pa­ble. For ex­am­ple, if each agent takes only 20% as long to train as the fol­low­ing one, then the to­tal over­head is only 25%.

Th­ese mechanisms can work to­gether; for ex­am­ple, each agent may re­quire some amount of non-reusable com­pu­ta­tion, but that amount may be re­duced by a clever ini­tial­iza­tion scheme.


I’ve out­lined an ap­proach to AI con­trol for model-free RL. I think there is a very good chance, per­haps as high as 50%, that this ba­sic strat­egy can even­tu­ally be used to train be­nign state-of-the-art model-free RL agents. Note that this strat­egy also ap­plies to tech­niques like evolu­tion that have his­tor­i­cally been con­sid­ered re­ally bad news for con­trol.

That said, the scheme in this post is still ex­tremely in­com­plete. I have re­cently pri­ori­tized build­ing a prac­ti­cal im­ple­men­ta­tion of these ideas, rather than con­tin­u­ing to work out con­cep­tual is­sues. That does not mean that I think the con­cep­tual is­sues are worked out con­clu­sively, but it does mean that I think we’re at the point where we’d benefit from em­piri­cal in­for­ma­tion about what works in prac­tice (which is a long way from how I felt about AI con­trol 3 years ago!)

I think the largest tech­ni­cal un­cer­tainty with this scheme is whether we can achieve enough ro­bust­ness to avoid ma­lign be­hav­ior in gen­eral.

This scheme does not ap­ply to any com­po­nents of our sys­tem which aren’t learned end-to-end. The idea is to use this train­ing strat­egy for any in­ter­nal com­po­nents of our sys­tem which use model-free RL. In par­allel, we need to de­velop al­igned var­i­ants of each other al­gorith­mic tech­nique that plays a role in our AI sys­tems. In par­tic­u­lar, I think that model-based RL with ex­ten­sive plan­ning is a likely stick­ing point for this pro­gram, and so is a nat­u­ral topic for fur­ther con­cep­tual re­search.

This was origi­nally posted here on 19th March, 2017.

No comments.