Reliability amplification

In a re­cent post I talked about ca­pa­bil­ity am­plifi­ca­tion, a pu­ta­tive pro­ce­dure that turns a large num­ber of fast weak agents into a slower, stronger agent.

If we do this in a naive way, it will de­crease re­li­a­bil­ity. For ex­am­ple, if…

  • Our weak policy fails with prob­a­bil­ity 1%.

  • In or­der to im­ple­ment a strong policy we com­bine 10 de­ci­sions made by weak agents.

  • If any of these 10 de­ci­sions is bad, then so is the com­bi­na­tion.

…then the com­bi­na­tion will be bad with 10% prob­a­bil­ity.

Although the com­bi­na­tion can be more pow­er­ful than any in­di­vi­d­ual de­ci­sion, in this case it is much less re­li­able. If we re­peat policy am­plifi­ca­tion sev­eral times, our failure prob­a­bil­ity could quickly ap­proach 1, even if it started out be­ing ex­po­nen­tially small.

Com­ple­men­tary to ca­pa­bil­ity am­plifi­ca­tion is re­li­a­bil­ity am­plifi­ca­tion: given a policy H that usu­ally works, can we im­ple­ment a policy H that works sig­nifi­cantly more re­li­ably?

To be slightly less im­pre­cise (but still quite crude):

  • Given a dis­tri­bu­tion A over poli­cies that ε-close to a be­nign policy for some ε ≪ 1, can we im­ple­ment a dis­tri­bu­tion A⁺ over poli­cies which is δ-close to a be­nign policy of similar ca­pa­bil­ity, for some δ ≪ ε?

When I say that A is ε-close to a be­nign policy, I mean that there is a be­nign dis­tri­bu­tion over poli­cies which has to­tal vari­a­tion dis­tance at most ε from A. We can view this is as a sim­ple mix­ture be­tween a sin­gle al­igned policy that is cho­sen with prob­a­bil­ity (1−ε) and an ad­ver­sar­ial policy that is cho­sen with prob­a­bil­ity ε.

I will move freely be­tween talk­ing about “poli­cies” and “dis­tri­bu­tions over poli­cies;” we can turn a dis­tri­bu­tion into a sin­gle policy by av­er­ag­ing.

The no­tion of “im­ple­ment” here is ex­actly the same as for ca­pa­bil­ity am­plifi­ca­tion: we can use any num­ber of im­ple­men­ta­tions of A and what­ever other re­sources are at our dis­posal, and must com­pute A⁺’s re­sponses to a se­quence of ob­ser­va­tions, within some time limit such as a day.

Reli­a­bil­ity am­plifi­ca­tion via deliberation

I think there is one very nat­u­ral way to do re­li­a­bil­ity am­plifi­ca­tion — if some var­i­ant of this idea doesn’t work, I’m not sure what else to try.

As a warm-up, let’s con­sider the set­ting where the agents want to make a sin­gle de­ci­sion rather than a se­quence of de­ci­sions — we’ll re­turn to the se­quen­tial case soon.

We sam­ple three agents A, B, and C from the given dis­tri­bu­tion. We ask them to talk for a while and then each pro­pose an ac­tion. Then we ask them to do ap­proval vot­ing amongst the three op­tions, and ran­domly pick from the max­i­mally-ap­proved-of pro­pos­als.

If all three agents are al­igned, then this pro­posal is clearly go­ing to pro­duce an al­igned ac­tion. If two the agents be­have ad­ver­sar­i­ally then we are clearly out of luck, but that hap­pens only with prob­a­bil­ity O(ε²), so we can ba­si­cally ig­nore it.

The in­ter­est­ing case is when there is one ad­ver­sar­ial agent and two al­igned agents. If we threw up our hands and picked ran­domly in this case, then we would have a failure prob­a­bil­ity of ε. So the ques­tion is: can we do bet­ter than that?

The ob­vi­ous hope is that al­igned agents can do bet­ter-than-chance at rec­og­niz­ing each other /​ at rec­og­niz­ing good ac­tions. If they can get any non-neg­ligible ad­van­tage, then we can get a non-neg­ligible re­duc­tion in er­ror prob­a­bil­ity (and by iter­at­ing the pro­ce­dure we can po­ten­tially get large re­duc­tions).

The se­quen­tial case

So far we have talked about mak­ing a sin­gle de­ci­sion. The prob­lem is more challeng­ing when we need to make a se­quence of de­ci­sions.

We can try to sim­ply gen­er­al­ize the vot­ing ap­proach, run­ning a new vote for each ac­tion. To see the difficulty, sup­pose that the op­ti­mal policy looks as fol­lows:

  1. Gen­er­ate some hard-to-trans­fer in­ter­nal state X. Think of X as some­thing like an in­tu­itive vi­sion for how to ap­proach the situ­a­tion.

  2. Use X to gen­er­ate a first ac­tion.

  3. After mak­ing an ob­ser­va­tion, use X to gen­er­ate a sec­ond ac­tion. It is im­por­tant that the same state X be used in steps [2] and [3].

Now sup­pose that in the first step the three agents all fol­low this pro­to­col cor­rectly. They ar­rive at three differ­ent pro­posed ac­tions, which are backed up by three differ­ent vi­sions. Now we have to pick one of these ac­tions, effec­tively com­mit­ting our­selves to a sin­gle agent. This de­stroys our abil­ity to en­sem­ble effec­tively in fu­ture steps: if the cho­sen agent fails then we are out of luck, since the other two agents don’t have the vi­sion needed to fol­low-through on that ac­tion.

To get around this prob­lem we need to count on a stronger prop­erty: what­ever kind of pri­vate state X an agent could gen­er­ate, the agents need to be able to jointly de­velop a “con­sen­sus” pri­vate state which serves the same func­tion. Then if one agent fails, the other agents can also use the con­sen­sus pri­vate state in or­der to con­tinue act­ing in­tel­li­gently.

(It would also be OK if only pairs of agents could es­tab­lish such a con­sen­sus state, as long as the third agent was able to ar­bi­trate be­tween the two in the event of a fu­ture failure.)


As of com­mit 1cbaae, this ap­proval-vot­ing scheme is part of my ALBA im­ple­men­ta­tion on GitHub.

It is im­ple­mented in am­­li­a­bil­­plify_re­li­a­bil­ity, and is ap­plied as part of am­­plify di­rectly prior to ap­ply­ing HCH.

As a re­sult I’ve moved one is­sue from the “FIXME: will fail” column to the “TODO: needs im­prove­ment” column.

Three differ­ences with re­dun­dant computation

There is an easy recipe for fault-tol­er­ant com­pu­ta­tion: com­pute the re­sult 3 times, and take the ma­jor­ity vote. This in­creases the com­pu­ta­tional cost by a fac­tor of 3, but de­creases the er­ror rate from ε to O(ε²).

Our re­li­a­bil­ity am­plifi­ca­tion scheme is con­sid­er­ably more sub­tle, and if we were to for­mal­ize the re­li­a­bil­ity am­plifi­ca­tion prob­lem as stated we would find it im­pos­si­ble. This sec­tion touches on the most im­por­tant ad­di­tional sub­tleties.

Build­ing consensus

Re­dun­dant com­pu­ta­tion nor­mally re­lies on the ex­is­tence of a unique cor­rect an­swer on which all un­cor­rupted com­pu­ta­tions will agree. In our set­ting, there may be a range of “cor­rect” be­hav­iors. This in­tro­duces the ad­di­tional challenge of be­ing able to agree on a par­tic­u­lar cor­rect an­swer (and to build a con­sen­sus ver­sion of the hid­den state un­der­ly­ing that an­swer).

Without some as­sump­tion of this form, re­li­a­bil­ity am­plifi­ca­tion is im­pos­si­ble. For ex­am­ple, sup­pose that there are 99 pos­si­ble good an­swers and 1 bad an­swer. Our dis­tri­bu­tion A picks at ran­dom amongst the 100 pos­si­ble an­swers. If the copies of A with good an­swers are com­pletely un­able to agree with each other, then there may sim­ply be no way to tell the good an­swers from the bad an­swer. In this case, we can never re­duce the er­ror prob­a­bil­ity be­low 1%.

Balanced abilities

In or­der for our scheme to work we need to as­sume that the abil­ities of all of the agents are roughly bal­anced. I think this is prob­a­bly nec­es­sary for any scheme to work. If you have an ε chance of a mal­i­cious su­per­in­tel­li­gence and a (1−ε) chance of a friendly hu­man, you are prob­a­bly out of luck — in a de­liber­a­tive pro­cess in­volv­ing two hu­mans and a su­per­in­tel­li­gence, the su­per­in­tel­li­gence is likely to have dis­pro­por­tionately much in­fluence over the out­come even if it is wrong.

Our par­tic­u­lar scheme fails out­right in the con­text of pow­er­ful ad­ver­saries. In gen­eral there is a 3ε chance that at least one of the agents is un­al­igned. We may be able to pre­vent the un­al­igned agent from caus­ing the group to adopt a re­ally bad an­swer. But even in the best case, the un­al­igned agent will prob­a­bly be able to tilt the bal­ance be­tween the two op­tions pro­posed by the al­igned agents, al­low­ing them to in­crease the prob­a­bil­ity of failure. If the un­al­igned agent is no smarter than the al­igned agents, then the al­igned agents can coun­ter­act this effect by prefer­en­tially pick­ing the bet­ter ac­tion. But if the un­al­igned agent is much smarter than the al­igned agents, then it can in­crease the prob­a­bil­ity of failures which are too sub­tle for the un­al­igned agents to an­ti­ci­pate, and the al­igned agents have no way to de­crease the prob­a­bil­ity of these failures.

Less time to think

Tra­di­tion­ally re­dun­dant com­pu­ta­tion in­creases costs by a con­stant fac­tor. But in the con­text of re­li­a­bil­ity am­plifi­ca­tion, we also make life harder for each agent: in ad­di­tion to pick­ing an ac­tion, they need to spend time ar­gu­ing about ac­tions and cri­tiquing each oth­ers’ choices. This gives each agent less time to think about their ac­tion. It is not clear whether we can offset this effect

Re­la­tion to ca­pa­bil­ity am­plifi­ca­tion and ALBA


Reli­a­bil­ity am­plifi­ca­tion seems com­pletely nec­es­sary if we want to do some­thing like ALBA us­ing imi­ta­tion or imi­ta­tion+RL, since oth­er­wise re­li­a­bil­ity will fall with each iter­a­tion of ca­pa­bil­ity am­plifi­ca­tion. Of course we could have a “ro­bust” ca­pa­bil­ity am­plifi­ca­tion pro­ce­dure which does not de­crease re­li­a­bil­ity. But de­sign­ing such a pro­ce­dure is strictly harder than re­li­a­bil­ity am­plifi­ca­tion. So I think it makes a lot of sense to split up the prob­lem into two pieces.

If work­ing with ap­proval-di­rec­tion and pure RL, there is ac­tu­ally a form of re­li­a­bil­ity am­plifi­ca­tion already baked in: if the over­seer fails with prob­a­bil­ity 1%, then this only changes the re­ward func­tion by 0.01, and an RL agent should still avoid highly un­de­sir­able ac­tions. That said, ca­pa­bil­ity am­plifi­ca­tion may still be nec­es­sary in a pure RL setup if we can’t solve the RL prob­lem to ar­bi­trary pre­ci­sion. In that case we may always have some non-neg­ligible prob­a­bil­ity of mak­ing a bad de­ci­sion, and af­ter ca­pa­bil­ity am­plifi­ca­tion this prob­a­bil­ity could be­come too large.

Balanc­ing ca­pa­bil­ity/​reliability

Reli­a­bil­ity am­plifi­ca­tion de­creases our agent’s ca­pa­bil­ity but in­creases its re­li­a­bil­ity. Ca­pa­bil­ity am­plifi­ca­tion in­creases ca­pa­bil­ity and de­creases re­li­a­bil­ity.

The hope is that we can some­how put these pieces to­gether in a way that ends up in­creas­ing both re­li­a­bil­ity and ca­pa­bil­ity.

If our re­li­a­bil­ity am­plifi­ca­tion step achieves a su­per­lin­ear re­duc­tion in er­ror prob­a­bil­ity from ε to o(ε), and our ca­pa­bil­ity am­plifi­ca­tion causes a lin­ear in­crease from ε to Θ(ε), then this seems al­most guaran­teed to work.

To see this, con­sider the ca­pa­bil­ity de­crease from re­li­a­bil­ity am­plifi­ca­tion. We know that for large enough N, N iter­a­tions of ca­pa­bil­ity am­plifi­ca­tion will more than offset this ca­pa­bil­ity de­crease. This N is a con­stant which is in­de­pen­dent of the ini­tial er­ror rate ε, and hence the to­tal effect of N iter­a­tions is to in­crease the er­ror rate to Θ(ε). For suffi­ciently small ε, this is more than offset by the ε → o(ε) re­li­a­bil­ity im­prove­ment from re­li­a­bil­ity am­plifi­ca­tion. So for suffi­ciently small ε we can in­crease both re­li­a­bil­ity and ca­pa­bil­ity.

A re­duc­tion from ε to O(ε²) is ba­si­cally the “best case” for re­li­a­bil­ity am­plifi­ca­tion, cor­re­spond­ing to the situ­a­tion where two al­igned agents can always reach cor­rect con­sen­sus. In gen­eral, al­igned agents will have some im­perfect abil­ity to reach con­sen­sus and to cor­rectly de­tect bad pro­pos­als from a mal­i­cious agent. In this set­ting, we are more likely to have an ε → O(ε) re­duc­tion. Hope­fully the con­stant can be very good.

There are also lower bounds on the achiev­able re­li­a­bil­ity ε de­rived from the re­li­a­bil­ity of the hu­man and of our learn­ing pro­ce­dures.

So in fact re­li­a­bil­ity am­plifi­ca­tion will in­crease re­li­a­bil­ity by some fac­tor R and de­crease ca­pa­bil­ity by some in­cre­ment Δ, while ca­pa­bil­ity am­plifi­ca­tion de­creases re­li­a­bil­ity by some fac­tor R′ and in­creases ca­pa­bil­ity by some in­cre­ment Δ′. Our hope is that there ex­ists some ca­pa­bil­ity am­plifi­ca­tion pro­ce­dure with Δ′/​log(R′) > Δ/​log(R), and which is effi­cient enough to be used as a re­ward func­tion for semi-su­per­vised RL.

I think that this con­di­tion is quite plau­si­ble but definitely not a sure thing; I’ll say more about this ques­tion in fu­ture posts.


A large com­pu­ta­tion is al­most guaran­teed to ex­pe­rience some er­rors. This poses no challenge for the the­ory of com­put­ing be­cause those er­rors can be cor­rected: by com­put­ing re­dun­dantly we can achieve ar­bi­trar­ily low er­ror rates, and so we can as­sume that even ar­bi­trar­ily large com­pu­ta­tions are es­sen­tially perfectly re­li­able.

A long de­liber­a­tive pro­cess is similarly guaran­teed to ex­pe­rience pe­ri­odic er­rors. Hope­fully, it is pos­si­ble to use a similar kind of re­dun­dancy in or­der to cor­rect these er­rors. This ques­tion is sub­stan­tially more sub­tle in this case: we can still use a ma­jor­ity vote, but here the space of op­tions is very large and so we need the ad­di­tional step of hav­ing the cor­rect com­pu­ta­tions ne­go­ti­ate a con­sen­sus.

If this kind of re­li­a­bil­ity am­plifi­ca­tion can work, then I think that ca­pa­bil­ity am­plifi­ca­tion is a plau­si­ble strat­egy for al­igned learn­ing. If re­li­a­bil­ity am­plifi­ca­tion doesn’t work well, then cas­cad­ing failures could well be a fatal prob­lem for at­tempts to define a pow­er­ful al­igned agent as a com­po­si­tion of weak al­igned agents.

This was origi­nally posted here on 20th Oc­to­ber, 2016.

The next post in this se­quence will be ‘Se­cu­rity Am­plifi­ca­tion’ by Paul Chris­ti­ano, on Satur­day 2nd Feb.