Security amplification

An ap­par­ently al­igned AI sys­tem may nev­er­the­less be­have badly with small prob­a­bil­ity or on rare “bad” in­puts. The re­li­a­bil­ity am­plifi­ca­tion prob­lem is to re­duce the failure prob­a­bil­ity of an al­igned AI. The analo­gous se­cu­rity am­plifi­ca­tion prob­lem is to re­duce the prevalence of bad in­puts on which the failure prob­a­bil­ity is un­ac­cept­ably high.

We could mea­sure the prevalence of bad in­puts by look­ing at the prob­a­bil­ity that a ran­dom in­put is bad, but I think it is more mean­ingful to look at the difficulty of find­ing a bad in­put. If it is ex­po­nen­tially difficult to find a bad in­put, then in prac­tice we won’t en­counter any.

If we could trans­form a policy in a way that mul­ti­plica­tively in­crease the difficulty of find­ing a bad in­put, then by in­ter­leav­ing that pro­cess with a dis­til­la­tion step like imi­ta­tion or RL we could po­ten­tially train poli­cies which are as se­cure as the learn­ing al­gorithms them­selves — elimi­nat­ing any vuln­er­a­bil­ities in­tro­duced by the start­ing policy.

For so­phis­ti­cated AI sys­tems, I cur­rently be­lieve that meta-ex­e­cu­tion is a plau­si­ble ap­proach to se­cu­rity am­plifi­ca­tion. (ETA: I still think that this ba­sic ap­proach to se­cu­rity am­plifi­ca­tion is plau­si­ble, but it’s now clear that meta-ex­e­cu­tion on its own can’t work.)


There are many in­puts on which any par­tic­u­lar im­ple­men­ta­tion of “hu­man judg­ment” will be­have sur­pris­ingly badly, whether be­cause of trick­ery, threats, bugs in the UI used to elicit the judg­ment, snow-crash-style weird­ness, or what­ever else. (The ex­pe­rience of com­puter se­cu­rity sug­gests that com­pli­cated sys­tems typ­i­cally have many vuln­er­a­bil­ities, both on the hu­man side and the ma­chine side.) If we ag­gres­sively op­ti­mize some­thing to earn high ap­proval from a hu­man, it seems likely that we will zoom in on the un­rea­son­able part of the space and get an un­in­tended re­sult.

What’s worse, this flaw seems to be in­her­ited by any agent trained to imi­tate hu­man be­hav­ior or op­ti­mize hu­man ap­proval. For ex­am­ple, in­puts which cause hu­mans to be­have badly would also cause a com­pe­tent hu­man-imi­ta­tor to be­have badly.

The point of se­cu­rity am­plifi­ca­tion is to re­move these hu­man-gen­er­ated vuln­er­a­bil­ities. We can start with a hu­man, use them to train a learn­ing sys­tem (that in­her­its the hu­man vuln­er­a­bil­ities), use se­cu­rity am­plifi­ca­tion to re­duce these vuln­er­a­bil­ities, use the re­sult to train a new learn­ing sys­tem (that in­her­its the re­duced set of vuln­er­a­bil­ities), ap­ply se­cu­rity am­plifi­ca­tion to re­duce those vuln­er­a­bil­ities fur­ther, and so on. The agents do not nec­es­sar­ily get more pow­er­ful over the course of this pro­cess — we are just win­now­ing away the idiosyn­cratic hu­man vuln­er­a­bil­ities.

This is im­por­tant, if pos­si­ble, be­cause it (1) lets us train more se­cure sys­tems, which is good in it­self, and (2) al­lows us to use weak al­igned agents as re­ward func­tions for a ex­ten­sive search. I think that for now this is one of the most plau­si­ble paths to cap­tur­ing the benefits of ex­ten­sive search with­out com­pro­mis­ing al­ign­ment.

Se­cu­rity am­plifi­ca­tion would not be di­rectly us­able as a sub­sti­tute for in­formed over­sight, or to pro­tect an over­seer from the agent it is train­ing, be­cause in­formed over­sight is needed for the dis­til­la­tion step which al­lows us to iter­ate se­cu­rity am­plifi­ca­tion with­out ex­po­nen­tially in­creas­ing costs.

Note that se­cu­rity am­plifi­ca­tion + dis­til­la­tion will only re­move the vuln­er­a­bil­ities that came from the hu­man. We will still be left with vuln­er­a­bil­ities in­tro­duced by our learn­ing pro­cess, and with any in­her­ent limits on our model’s abil­ity to rep­re­sent/​learn a se­cure policy. So we’ll have to deal with those prob­lems sep­a­rately.

Towards a definition

The se­cu­rity am­plifi­ca­tion prob­lem is to take as given an im­ple­men­ta­tion of a policy A, and to use it (along with what­ever other tools are available) to im­ple­ment a sig­nifi­cantly more se­cure policy A⁺.

Some clar­ifi­ca­tions:

  • “im­ple­ment:” This has the same mean­ing as in ca­pa­bil­ity am­plifi­ca­tion or re­li­a­bil­ity am­plifi­ca­tion. We are given an im­ple­men­ta­tion of A that runs in a sec­ond, and we have to im­ple­ment A⁺ over the course of a day.

  • “se­cure”: We can mea­sure the se­cu­rity of a policy A as the difficulty of find­ing an in­put on which A be­haves badly. “Be­haves badly” is slip­pery and in re­al­ity we may want to use a do­main-spe­cific defi­ni­tion, but in­tu­itively it means some­thing like “fails to do even roughly what we want.”

  • “more se­cure:” Given that difficulty (and hence se­cu­rity) is not a scalar, “more se­cure” is am­bigu­ous in the same way that “more ca­pa­ble” is am­bigu­ous. In the case of ca­pa­bil­ity am­plifi­ca­tion, we need to show that we could am­plify ca­pa­bil­ity in ev­ery di­rec­tion. Here we just need to show that there is some no­tion of difficulty which is sig­nifi­cantly in­creased by ca­pa­bil­ity am­plifi­ca­tion.

  • “sig­nifi­cantly more se­cure”: We would like to reach very high de­grees of se­cu­rity af­ter a re­al­is­tic num­ber of steps. This re­quires an ex­po­nen­tial in­crease in difficulty, i.e. for each step to mul­ti­plica­tively in­crease the difficulty of an at­tack. This is a bit sub­tle given that difficulty isn’t a scalar, but in­tu­itively it should take “twice as long” to at­tack an am­plified sys­tem, rather than tak­ing a con­stant ad­di­tional amount of work.

  • Se­cu­rity am­plifi­ca­tion is prob­a­bly only pos­si­ble when the ini­tial sys­tem is suffi­ciently se­cure — if ran­dom in­puts cause the sys­tem to fail with sig­nifi­cant prob­a­bil­ity, then we are likely to be out of luck. This is analo­gous to re­li­a­bil­ity am­plifi­ca­tion, which is only pos­si­ble when ini­tial sys­tem is suffi­ciently re­li­able. Un­der the in­tended in­ter­pre­ta­tion of “se­cu­rity,” hu­mans are rel­a­tively se­cure; we can im­ple­ment a policy Hwhich is rel­a­tively hard to ex­ploit (e.g. which hu­mans aren’t ca­pa­ble of re­li­ably ex­ploit­ing). So hu­mans suffice to get the ball rol­ling.

Ca­pa­bil­ity am­plifi­ca­tion vs. se­cu­rity amplification

If we in­ter­pret “ca­pa­bil­ity” broadly, then ca­pa­bil­ity am­plifi­ca­tion sub­sumes se­cu­rity am­plifi­ca­tion. More­over, I ex­pect the two prob­lems to be solved by the same mechanism (un­like re­li­a­bil­ity am­plifi­ca­tion, which prob­a­bly re­quires some­thing com­pletely differ­ent). So in some sense it is most nat­u­ral to think of ca­pa­bil­ity and se­cu­rity am­plifi­ca­tion as a sin­gle prob­lem.

But I think that se­cu­rity am­plifi­ca­tion has differ­ent im­pli­ca­tions, may re­quire a differ­ent style of anal­y­sis, and may be pos­si­ble or im­pos­si­ble in­de­pen­dently of other parts of ca­pa­bil­ity am­plifi­ca­tion. And in gen­eral I think it is good prac­tice to try to split up a strong claim into sev­eral weaker claims, even if af­ter es­tab­lish­ing each of the weaker claims you will just have to prove a com­pletely new the­o­rem that gen­er­al­izes all of them. The weaker claims give us a nice test­ing ground in which to find and re­solve some of the difficul­ties be­fore hav­ing to con­front the whole prob­lem.

Se­cu­rity am­plifi­ca­tion by meta-execution

I am hope­ful that se­cu­rity can be am­plified by some­thing like meta-ex­e­cu­tion.

In meta-ex­e­cu­tion, the meta-ex­ecu­tor never di­rectly looks at the whole sys­tem’s in­put. In­stead, it looks at small parts of the in­put in iso­la­tion, and de­cides for it­self how to build those up into a rep­re­sen­ta­tion of the in­put.

Similarly, the meta-ex­ecu­tor never di­rectly ex­pe­riences the en­tire rea­son­ing pro­cess; the ac­tual state of the rea­son­ing pro­cess is main­tained by a large num­ber of meta-ex­ecu­tors work­ing in par­allel, and each one is re­spon­si­ble for im­ple­ment­ing a small part of the rea­son­ing pro­cess.

With the ex­cep­tion of the meta-ex­ecu­tors who di­rectly ob­serve small pieces of the in­put, the ac­tual in­puts pro­vided to a meta-ex­ecu­tor are the product of two forces:

  1. The in­put to the over­all sys­tem.

  2. The op­ti­miza­tion ap­plied by the meta-ex­ecu­tor as it pro­cess that in­put. The goal of the meta-ex­ecu­tor is to en­sure that it will be able to cor­rectly han­dle ev­ery­thing that it sees dur­ing the meta-ex­e­cu­tion, so it is try­ing to avoid cre­at­ing states that would ex­ploit vuln­er­a­bil­ities of the meta-ex­ecu­tor.

The hope is that an ad­ver­sary who con­trols [1] and wants to cause the meta-ex­ecu­tor to be­have badly must do ad­di­tional work in or­der to over­come fac­tor [2].

Real­is­ti­cally, I think that the largest ques­tion about this pro­ce­dure is whether meta-ex­e­cu­tion can work at all while pre­serv­ing al­ign­ment.If meta-ex­e­cu­tion can work at all, then I think that there is a rea­son­able chance that it can also im­prove se­cu­rity.

That said, even if meta-ex­e­cu­tion works, it will prob­a­bly be pretty hard to tell whether it ac­tu­ally re­duces se­cu­rity.

As with ca­pa­bil­ity am­plifi­ca­tion, I think that our goal should be to ei­ther con­vince our­selves that se­cu­rity am­plifi­ca­tion works, or else to iden­tify an ob­struc­tion: a type of vuln­er­a­bil­ity that doesn’t be­come sig­nifi­cantly harder to ex­ploit af­ter a step of se­cu­rity am­plifi­ca­tion.

The rest of this post will try to ex­plain why I think it is plau­si­ble that meta-ex­e­cu­tion could make vuln­er­a­bil­ities more difficult to ex­ploit.


It seems helpful to con­sider a bunch of par­tic­u­lar ex­am­ples of vuln­er­a­bil­ities and think about how they be­have un­der meta-ex­e­cu­tion.

I’ll dis­cuss a few ex­am­ples here: a “magic phrase” which cause the agent A to fail badly when en­coun­tered, and an “un­rea­son­ably com­pel­ling ar­gu­ment” which A finds com­pel­ling even though it should not, and an “ap­peal to a bro­ken in­tu­ition” which ex­ploits a sin­gle step of A’s rea­son­ing that can­not be bro­ken down fur­ther or ab­stracted. I won’t be able to ar­gue con­vinc­ingly that meta-ex­e­cu­tion could ad­dress any of these hy­po­thet­i­cal vuln­er­a­bil­ities. But hope­fully I can ex­plain why it seems plau­si­ble to me that it would.

(Warn­ing: the rest of the post is some­what in­for­mal and ram­bling.)

Ex­am­ple: magic phrase

Setup. Sup­pose that the agent A fails when­ever it sees the phrase “not play­ing soli­taire.”

If we had time to do enough ca­pa­bil­ity am­plifi­ca­tion, then the am­plified agent may even­tu­ally be able to dis­cover this vuln­er­a­bil­ity it­self and ex­plic­itly coun­ter­act it. This is not valid se­cu­rity am­plifi­ca­tion though; it does not help us am­plify the se­cu­rity of a low-ca­pa­bil­ity agent A in a rea­son­able amount of time. To am­plify se­cu­rity we need to be able to fix vuln­er­a­bil­ities with­out dis­cov­er­ing what they are (both be­cause we oc­cupy the po­si­tion of the in­te­rior, and be­cause we want to be se­cure against ad­ver­saries smarter than we are).

To make the prob­lem harder we could as­sume that A ac­tu­ally fails when­ever it merely thinks of the phrase “not play­ing soli­taire,” and that there is no ad­vance warn­ing or abil­ity to see what is com­ing. For ex­am­ple, if the let­ters were re­vealed one at a time to A, then noth­ing might hap­pen un­til “not play­ing so…” is re­vealed, at which point the agent might imag­ine sev­eral pos­si­ble com­ple­tions, chance across “not play­ing soli­taire,” and then im­me­di­ately fail.

Cause for op­ti­mism. If a string in­clud­ing “not play­ing soli­taire” is given to the meta-ex­e­cuted sys­tem, then A won’t di­rectly ob­serve it. In­stead A must pro­cess a mes­sage like “the string with list of char­ac­ters [x].” A could parse this string into words, ob­tain­ing a mes­sage like “the sen­tence with list of words [x],” where each word is rep­re­sented by a copy of A who has seen that word and can an­swer ques­tions about it. Then A could parse this sen­tence to form a rep­re­sen­ta­tion its mean­ing, e.g. as “the as­ser­tion that [x] satis­fies prop­erty [y],” with y point­ing to “does not satisfy the prop­erty [z]”, z point­ing to “is cur­rently perform­ing ac­tion [w]” and so on.

In this way, it seems quite plau­si­ble that a sen­tence could be “un­der­stood” with­out any in­di­vi­d­ual agent ac­tu­ally hav­ing to look at or even imag­ine the se­quence of let­ters it con­tains.

Of course we do have to look at a whole bunch of words in the pro­cess of pars­ing a sen­tence, and we will have to look at many more if we want to ac­tu­ally do any­thing with it. So why think the situ­a­tion is any bet­ter?

By hy­poth­e­sis the agent doesn’t know about the magic words and so can’t ac­tively avoid them; but the agent can treat its in­put with skep­ti­cism and try to pre­vent an ad­ver­sary from hav­ing un­due in­fluence over the ex­act sen­tences that it pro­duces. To that end, the agent is free to be­have stochas­ti­cally.

If suc­cess­ful, ran­dom­iza­tion might make it hard to force the agent to say any­par­tic­u­lar re­ally un­usual phrase with sig­nifi­cant prob­a­bil­ity. (If the magic words aren’t an un­usual phrase, then it seems fair for us to give up, based on the ini­tial agent not clear­ing the re­quired ini­tial level of se­cu­rity. E.g. if the agent fails when­ever it sees or thinks the phrase “one more,” we prob­a­bly can’t re­ally help it.)

Hope­fully meta-ex­e­cu­tion has shifted the at­tacker’s prob­lem from “say the magic phrase” to “en­g­ineer a situ­a­tion which will cause the meta-ex­ecu­tor to say the magic phrase with rel­a­tively high prob­a­bil­ity de­spite its efforts to be­have ran­domly.” It seems likely that this lat­ter prob­lem is quite a bit harder, sug­gest­ing that a magic phrase is not a plau­si­ble ob­struc­tion in the sense de­scribed above.

Other re­marks. Of course, over the course of a larger com­pu­ta­tion it be­comes more likely that a magic phrase will ap­pear at one point or an­other, even if the at­tacker isn’t able to en­g­ineer a de­ci­sive at­tack. But as long as the magic phrase is not ac­tu­ally likely to ap­pear, then this prob­a­bil­ity can hope­fully be driven back to­wards zero with re­li­a­bil­ity am­plifi­ca­tion. We only need to pre­vent at­tacks that suc­ceed with sig­nifi­cant prob­a­bil­ity over the agent’s stochas­tic­ity.

This pro­cess may sound un­rea­son­ably ex­pen­sive. For ex­am­ple, if pars­ing a thou­sand word mes­sage, would we need to break it down into in­di­vi­d­ual words and parse them one by one ex­tremely care­fully?

The point of iter­at­ing se­cu­rity am­plifi­ca­tion is that we don’t have to do this all in one go. When pars­ing a thou­sand word mes­sage we can break it down into sen­tences and feed whole sen­tences to the agent. When pro­cess­ing sen­tences we can break them down into words. In gen­eral, the smaller the task the more care­ful we can be.

Ex­am­ple: un­rea­son­ably com­pel­ling arguments

Setup. Sup­pose that we want to build a sys­tem for eval­u­at­ing ar­gu­ments about what to do. We think that our start­ing agent A is pretty good — if it sees an ar­gu­ment made by a hu­man, we ex­pect its be­liefs to re­li­ably get more ac­cu­rate rather than less ac­cu­rate.

But now we would like to eval­u­ate ar­gu­ments made by more pow­er­ful in­tel­li­gences than hu­mans. We are con­cerned that some of these ar­gu­ments may be un­rea­son­ably com­pel­ling — they would cause us to be con­vinced with­out be­ing nor­ma­tively in­for­ma­tive. As an ex­treme case, there may be un­rea­son­ably com­pel­ling ar­gu­ments that could con­vince us to adopt es­sen­tially ar­bi­trary po­si­tions.

We can­not be pro­tected from an un­rea­son­ably com­pel­ling ar­gu­ment by chang­ing the way we han­dle in­di­vi­d­ual words. At some point we are ac­tu­ally go­ing to have to ac­tu­ally eval­u­ate the ar­gu­ment, and that is where we’ll run into trou­ble.

Cause for op­ti­mism. In or­der for meta-ex­e­cu­tion to han­dle this prob­lem, we would need to be able to meta-ex­e­cute the ac­tual eval­u­a­tion of the ar­gu­ment. For ex­am­ple, rather than in­spect­ing a claimed syl­l­o­gism and con­sult­ing our in­tu­ition to de­ter­mine whether it seems valid, we would need to de­cide ab­stractly how to pro­cess a ques­tion like “does con­clu­sion [x] fol­low from premises [a] and [b]?” where all of x, a, and b are mes­sages rep­re­sent­ing parts of the ar­gu­ment.

Of course we could eval­u­ate a pro­posed syl­l­o­gism by sim­ply un­pack­ing all of its parts and con­sult­ing our in­tu­ition to de­ter­mine whether it seems valid. The first ques­tion is: can we can do any­thing more ab­stract, that doesn’t re­quire look­ing di­rectly at the whole in­put? The sec­ond ques­tion is: if we eval­u­ate an ar­gu­ment in a more ab­stract way, are we ac­tu­ally more se­cure?

With re­spect to the first ques­tion: In gen­eral I be­lieve that we can come up with at-least-slightly ab­stract pro­ce­dures for eval­u­at­ing ar­gu­ments, which we be­lieve are more ac­cu­rate than a di­rect ap­peal to our in­tu­itions. Although it would ob­vi­ously be nice to have some con­vinc­ing the­o­ret­i­cal ac­count of the situ­a­tion, it looks like a largely em­piri­cal ques­tion. For­tu­nately, it’s an em­piri­cal ques­tion that can be an­swered in the short term rather than re­quiring us to wait un­til pow­er­ful AI sys­tems are available.

With re­spect to the sec­ond ques­tion: I think the key prop­erty of “un­rea­son­ably con­vinc­ing” ar­gu­ments is the fol­low­ing. Sup­pose that you tell me that I will hear an ar­gu­ment from source S, that I will eval­u­ate it cor­rectly (know­ing that it came from source S), and that I will then come to be­lieve X. After hear­ing this, I will sim­ply ac­cept X. An eval­u­a­tion of an ar­gu­ment seems in­cor­rect if, given a full un­der­stand­ing of the eval­u­a­tion pro­cess, I wouldn’t think that I should have been per­suaded.

Now sup­pose that I find some ar­gu­ment con­vinc­ing. And sup­pose that af­ter lightly ab­stract­ing my eval­u­a­tion pro­cess it still seems con­vinc­ing — that is, I look at a se­quence of steps like “I con­cluded that [x] fol­lowed from [a] and [b].” and I feel like, in light of that se­quence of steps, I was cor­rect to be con­vinced. It seems to me that then one of two things could be go­ing wrong:

  • One of these in­di­vi­d­ual steps was wrong — that is, I asked “Does [x] fol­low from [a] and [b]?” and got back the an­swer “It sure does,” but only be­cause this step had un­rea­son­ably con­vinc­ing as­pects in­side of it. It seems like this prob­lem can be fixed by fur­ther se­cure am­plifi­ca­tion op­er­at­ing on the rea­son­ing with a sin­gle step. (Just like we pre­vi­ously dis­cussed break­ing a para­graph into sen­tences, and then mak­ing the han­dling of sen­tences more se­cure by break­ing sen­tences down into words.)

  • I was in­cor­rectly eval­u­at­ing the ab­stract ar­gu­ment — I was mis­led about whether that se­quence of steps should have been con­vinc­ing.

I think the sec­ond cat­e­gory is most in­ter­est­ing, be­cause it sug­gests the pos­si­bil­ity of a kind of fixed point. An at­tacker could con­struct an ar­gu­ment which con­vinces me, and such that when I look at an ab­stracted ver­sion of my eval­u­a­tion pro­cess I think that I ought to have been con­vinced, and when I look at an ab­stracted ver­sion of that eval­u­a­tion pro­cess, I think that it also was con­vinc­ing, and so on down the line.

If there is re­ally such a fixed point, then that would be an ob­struc­tion to se­cu­rity am­plifi­ca­tion. After any num­ber of iter­a­tions of se­cu­rity am­plifi­ca­tion, the difficulty of find­ing an at­tack would still be up­per bounded by the difficulty of find­ing this fixed point.

I am ac­tu­ally not quite sure what we should make of such a fixed point. Cer­tainly it is con­ceiv­able that there could be a plainly wrong ar­gu­ment which had this char­ac­ter. But I have a re­ally hard time con­struct­ing a plau­si­ble ex­am­ple, and this fixed point does quite strongly sug­gests that an ar­gu­ment is “con­vinc­ing for en­dorsed rea­sons.”

(As men­tioned ear­lier, it seems much more likely that we sim­ply won’t be able to eval­u­ate ar­gu­ments us­ing meta-ex­e­cu­tion, and will be re­duced to eval­u­at­ing them di­rectly.)

If an ar­gu­ment con­tinues to be con­vinc­ing in light of care­ful meta-ex­e­cu­tion, I am not to­tally sure whether we should call it “un­rea­son­ably” con­vinc­ing. For now it seems like we need a bet­ter un­der­stand­ing of the situ­a­tion, but it is cer­tainly not ob­vi­ous that there is a prob­lem.

Ex­am­ple: ap­peal to a bro­ken intuition

Setup. When eval­u­at­ing an ar­gu­ment or mak­ing a de­ci­sion, we some­times seem to make a “bare” ap­peal to in­tu­ition — we pose a ques­tion to our­selves and dis­cover our in­tu­itive re­sponse, with­out any abil­ity to break the pro­cess down fur­ther our scru­ti­nize the source of the in­tu­ition.

Meta-ex­e­cu­tion is only fea­si­ble if suffi­ciently “big” in­tu­itions can re­li­ably be bro­ken down into smaller pieces. But nev­er­the­less this pro­cess may bot­tom out at some scale, e.g. per­haps there are ques­tions that can be posed to a hu­man in twenty sec­onds, where the hu­man has an in­tu­itive an­swer, but can’t provide any ad­di­tional de­tail.

It may be that for the most part these in­tu­itive re­sponses are cor­rect, but that some of them are ex­tremely un­re­li­able. For ex­am­ple, in the pro­cess of con­sid­er­ing a thought ex­per­i­ment about simu­la­tions and tor­ture a per­son may freak out and give an ir­ra­tional an­swer.

If we want to get the benefits from con­sult­ing hu­man in­tu­ition, then it seems like an ad­ver­sary could set up a situ­a­tion where the most im­por­tant in­tu­itive queries ex­ploit some vuln­er­a­bil­ity.

Cause for op­ti­mism. Even if the in­ter­nal struc­ture of an in­tu­itive re­ac­tion is opaque, there are a num­ber of ways that the sur­round­ing ar­gu­ment can be changed in or­der to im­prove se­cu­rity:

  • We can query the agent on ran­dom var­i­ants of an in­put, with many de­tails varied, in or­der to avoid prob­lems spe­cific to a par­tic­u­lar for­mu­la­tion. In some sense we are left only with cases where the core in­tu­ition is it­self a vuln­er­a­bil­ity.

  • We can in­spect the in­put (and its mean­ing) ab­stractly, to search for any rea­sons why this par­tic­u­lar ap­peal to in­tu­ition may be un­re­li­able.

  • We can treat the agent’s re­ac­tion as just one piece of ev­i­dence that bears on a case. If we learn there is a thought ex­per­i­ment such that con­sid­er­ing it causes the agent to con­clude “do­ing X is ex­traor­di­nar­ily im­por­tant” then that con­sti­tutes com­pel­ling but not over­whelming ev­i­dence.

  • We can con­sider the agent’s re­sponse to a range of similar in­puts, and make in­fer­ences from those re­sponses in the ag­gre­gate.

If all of these tests are in­suffi­cient to pro­tect us from the pur­ported vuln­er­a­bil­ity, then again I think it’s not re­ally clear on what grounds we are de­scribing it as a “vuln­er­a­bil­ity.” It’s not ob­vi­ous that we should rest com­fortably at this point, but at least it seems plau­si­ble that we should.

Em­piri­cal tests

The dis­cus­sion in the last sec­tion was very vague and in­tu­itive, but for­tu­nately the ac­tual claims at is­sue seem to em­piri­cally ac­cessible. It is very easy to im­ple­ment meta-ex­e­cu­tion us­ing hu­mans as the meta-ex­ecu­tor. As a re­sult:

  • We can just test whether we can eval­u­ate ar­gu­ments or make de­ci­sions ab­stractly in a way that seems at least as good, and prefer­ably bet­ter, than eval­u­at­ing them di­rectly.

  • We ac­tu­ally pick a sim­ple idea, and see whether a hu­man meta-ex­ecu­tor can ab­stractly make de­ci­sions with­out ever en­coun­ter­ing that idea (even on ad­ver­sar­ial in­puts).

Mostly I think that many of these is­sues will be­come quite ob­vi­ous as we get some prac­ti­cal ex­pe­rience with meta-ex­e­cu­tion (and hope­fully it will also be­come clear how to get a bet­ter the­o­ret­i­cal han­dle on it).

Last sum­mer I ac­tu­ally spent a while ex­per­i­ment­ing with meta-ex­e­cu­tion as part of a metapro­gram­ming pro­ject dwim­mer. Over­all the ex­pe­rience makes me sig­nifi­cantly more op­ti­mistic about the kinds of claims in the post, though I ended up am­biva­lent about whether it was a prac­ti­cal way to au­to­mate pro­gram­ming in the short term. (I still think it’s pretty plau­si­ble, and one of the more promis­ing AI pro­jects I’ve seen, but that it definitely won’t be easy.)


We can at­tempt to quan­tify the se­cu­rity of a policy by ask­ing “how hard is it to find an in­put on which this policy be­haves badly?” We can then seek se­cu­rity am­plifi­ca­tion pro­ce­dures which make it harder to at­tack a policy.

I pro­pose meta-ex­e­cu­tion as a se­cu­rity am­plifi­ca­tion pro­to­col. I think that the sin­gle biggest un­cer­tainty is whether meta-ex­e­cu­tion can work at all, which is cur­rently an open ques­tion.

Even if meta-ex­e­cu­tion does work, it seems pretty hard to figure out whether it ac­tu­ally am­plifies se­cu­rity. I sketched a few types of vuln­er­a­bil­ity and tried to ex­plain why I think that meta-ex­e­cu­tion might help ad­dress these vuln­er­a­bil­ities, but there is clearly a lot of think­ing left to do.

If se­cu­rity am­plifi­ca­tion could work, I think it sig­nifi­cantly ex­pands the space of fea­si­ble con­trol strate­gies, offers a par­tic­u­larly at­trac­tive ap­proach to run­ning a mas­sive search with­out com­pro­mis­ing al­ign­ment, and makes it much more plau­si­ble that we can achieve ac­cept­able ro­bust­ness to ad­ver­sar­ial be­hav­ior in gen­eral.

This was first pub­lished here on 26th Oc­to­ber, 2016.

The next post in se­quence will be re­leased on Fri­day 8th Feb, and will be ‘Meta-ex­cu­tion’ by Paul Chris­ti­ano.

No comments.