Useful Does Not Mean Secure

Brief sum­mary of what I’m try­ing to do with this post:

  1. Con­trast a “Use­ful­ness” fo­cused ap­proach to build­ing AI with a “Se­cu­rity” fo­cused ap­proach, and try to give an ac­count of where se­cu­rity prob­lems come from in AI.

  2. Show how marginal trans­parency im­prove­ments don’t nec­es­sar­ily im­prove things from the per­spec­tive of se­cu­rity.

  3. De­scribe what re­search is hap­pen­ing that I think is mak­ing progress from the per­spec­tive of se­cu­rity.

In this post I will be at­tempt­ing to taboo the term ‘al­ign­ment’, and just talk about prop­er­ties of sys­tems. The be­low is not very origi­nal, I’m of­ten just say­ing things in my own words that Paul and Eliezer have writ­ten, in large part just to try to think through the con­sid­er­a­tions my­self. My thanks to Abram Dem­ski and Rob Bens­inger for com­ments on a draft of this post, though this doesn’t mean they en­dorse the con­tent or any­thing.

Use­ful Does Not Mean Secure

This post grew out of a com­ment thread el­se­where. In that thread, Ray Arnold was wor­ried that there was an un­canny valley of how good we are at un­der­stand­ing and build­ing AI where we can build AGI but not a safe AGI. Ro­hin Shah replied, and I’ll quote from his re­ply:

Con­sider in­stead this wor­ld­view:

The way you build things that are use­ful and do what you want is to un­der­stand how things work and put them to­gether in a de­liber­ate way. If you put things to­gether ran­domly, they ei­ther won’t work, or will have un­in­tended side effects.

(This wor­ld­view can ap­ply to far more than AI; e.g. it seems right in ba­si­cally ev­ery STEM field. You might ar­gue that putting things to­gether ran­domly seems to work sur­pris­ingly well in AI, to which I say that it re­ally doesn’t, you just don’t see all of the effort where you put things to­gether ran­domly and it sim­ply flat-out fails.)

The ar­gu­ment “it’s good to for peo­ple to un­der­stand AI tech­niques bet­ter even if it ac­cel­er­ates AGI” is a very straight­for­ward non-clever con­se­quence of this wor­ld­view.

[...]

Un­der the wor­ld­view I men­tioned, the first-or­der effect of bet­ter un­der­stand­ing of AI sys­tems, is that you are more likely to build AI sys­tems that are use­ful and do what you want.

A lot of things Ro­hin says in that thread make sense. But in this post, let me point to a differ­ent per­spec­tive on AI that I might con­sider, if I were to fo­cus en­tirely on Paul Chris­ti­ano’s model of greedy al­gorithms in part II of his post on what failure looks like. That per­spec­tive sounds some­thing like this:

The way you build things that are use­ful and do what you want, when you’re in an en­vi­ron­ment with much more pow­er­ful op­ti­misers than you, is to spend a lot of ex­tra time mak­ing them se­cure against ad­ver­saries, over and above sim­ply mak­ing them use­ful. This is so that the other op­ti­misers can­not ex­ploit your sys­tem to achieve their own goals.

If you build things that are use­ful, pre­dictable, and don’t have bad side-effects, but are sub­ject to far more pow­er­ful op­ti­mi­sa­tion pres­sures than you, then by de­fault the things you build will be taken over by other forces and end up not be­ing very use­ful at all.

An im­por­tant dis­tinc­tion about ar­tifi­cial in­tel­li­gence re­search is that you’re not sim­ply com­pet­ing against other hu­mans, where you have to worry about hack­ers, gov­ern­ments and poli­ti­cal groups, but that the core goal of ar­tifi­cial in­tel­li­gence re­search is the cre­ation of much more pow­er­ful gen­eral op­ti­misers than cur­rently ex­ist within hu­man­ity. This is a differ­ence in kind from all other STEM fields.

Whereas nor­mal pro­gram­ming sys­tems that aren’t built quite right are more likely to do dumb things or just break, when you make an AI sys­tem that isn’t ex­actly what you wanted, the sys­tem might be pow­er­fully op­ti­mis­ing for other tar­gets in a way that has the po­ten­tial to be highly ad­ver­sar­ial. In dis­cus­sions of AI al­ign­ment, Stu­art Rus­sell of­ten likes to use an anal­ogy to “build­ing bridges that stay up” be­ing an en­tirely in­te­grated field, not dis­tinct from bridge build­ing. To ex­tend the anal­ogy a lit­tle, you might say the field of AI is un­usual in that if you don’t quite make the bridge well enough, the bridge it­self may ac­tively seek out se­cu­rity vuln­er­a­bil­ities that bring the bridge down, then hide them from your at­ten­tion un­til such a time as it has the free­dom to take the bridge down in one go, and then take out all the other bridges in the world.

Now, talk of AI nec­es­sar­ily blurs the line be­tween ‘ex­ter­nal op­ti­mi­sa­tion pres­sures’ and ‘the sys­tem is use­ful and does what you want’ be­cause the sys­tem it­self is cre­at­ing the new, pow­er­ful op­ti­mi­sa­tion pres­sure that needs se­cur­ing against. Paul’s post on what failure looks like talks about this, so I’ll quote it here:

Modern ML in­stan­ti­ates mas­sive num­bers of cog­ni­tive poli­cies, and then fur­ther re­fines (and ul­ti­mately de­ploys) what­ever poli­cies perform well ac­cord­ing to some train­ing ob­jec­tive. If progress con­tinues, even­tu­ally ma­chine learn­ing will prob­a­bly pro­duce sys­tems that have a de­tailed un­der­stand­ing of the world, which are able to adapt their be­hav­ior in or­der to achieve spe­cific goals.

Once we start search­ing over poli­cies that un­der­stand the world well enough, we run into a prob­lem: any in­fluence-seek­ing poli­cies we stum­ble across would also score well ac­cord­ing to our train­ing ob­jec­tive, be­cause perform­ing well on the train­ing ob­jec­tive is a good strat­egy for ob­tain­ing in­fluence.

How fre­quently will we run into in­fluence-seek­ing poli­cies, vs. poli­cies that just straight­for­wardly pur­sue the goals we wanted them to? I don’t know.

You could take the po­si­tion that, even though se­cu­rity work is not nor­mally cen­tral to a field, this new se­cu­rity work is already cen­tral to this field, so in­creas­ing the abil­ity to build ‘use­ful’ things will nat­u­rally have to solve this novel se­cu­rity work, so the field of AI will get it right by de­fault.

This is my un­der­stand­ing of Paul’s main­line ex­pec­ta­tion (based on his es­ti­mates here and that his work is based around mak­ing use­ful /​ well mo­ti­vated AI de­scribed here, here and in Ro­hin’s com­ment on that post) and also my un­der­stand­ing of Ro­hin’s main­line ex­pec­ta­tion (based on his es­ti­mates here). My un­der­stand­ing is this still means there’s a lot of value on the table from marginal work, so both of them work on the prob­lem, but by de­fault they ex­pect the field to en­gage with this prob­lem and do it well.

Res­tate­ment: In nor­mal tech com­pa­nies, there’s a differ­ence be­tween “mak­ing use­ful sys­tems” and “mak­ing se­cure sys­tems”. In the field of AI, “mak­ing use­ful sys­tems” in­cludes po­ten­tially build­ing pow­er­ful ad­ver­saries, which in­volves novel se­cu­rity prob­lems, so you might ex­pect that ex­e­cut­ing the stan­dard “make use­ful sys­tems” will re­sult in solv­ing the novel se­cu­rity fea­tures.

For ex­am­ple, in a de­bate on in­stru­men­tal con­ver­gence be­tween var­i­ous ma­jor AI re­searchers, this was also the po­si­tion that Francesca Rossi took:

Stu­art, I agree that it would easy to build a coffee fetch­ing ma­chine that is not al­igned to our val­ues, but why would we do this? Of course value al­ign­ment is not easy, and still a re­search challenge, but I would make it part of the pic­ture when we en­vi­sion fu­ture in­tel­li­gent ma­chines.

How­ever, Yann LeCun said some­thing sub­tly differ­ent:

One would have to be rather in­com­pe­tent not to have a mechanism by which new terms in the ob­jec­tive could be added to pre­vent pre­vi­ously-un­fore­seen bad be­hav­ior.

Yann is im­plic­itly tak­ing the stance that there will not be pow­er­ful ad­ver­sar­ial pres­sures ex­ploit­ing such un­fore­seen differ­ences in the ob­jec­tive func­tion and hu­man­ity’s val­ues. His re­sponses are of the kind “We wouldn’t do that” and “We would change it quickly when those prob­lems arose”, but not “Here’s how you build a ma­chine learn­ing sys­tem that can­not be flawed in this way”. It seems to me that he does not ex­pect there to be any fur­ther se­cu­rity con­cerns of the type dis­cussed above. If I pointed out a way that your sys­tem would malfunc­tion, it is some­times okay to say “Oh, if any­one ac­ci­den­tally gives that in­put to the sys­tem, then we’ll see and fix any prob­lems that oc­cur”, but if your gov­ern­ment com­puter sys­tem is not se­cure, then by the time you’ve no­ticed what’s hap­pen­ing, a pow­er­ful ad­ver­sary is in­side your sys­tem and tak­ing ac­tions against you.

(Though I should men­tion that I don’t think this is the crux of the mat­ter for Yann. I think his key dis­agree­ment is that he thinks we can­not talk use­fully about safe AGI de­sign be­fore we know how to build an AGI—he doesn’t think that pro­saic AI al­ign­ment is in prin­ci­ple fea­si­ble or worth think­ing about.)

In gen­eral, it seems to me that if you show me how an AI sys­tem is flawed, if my re­sponse is to sim­ply patch that par­tic­u­lar prob­lem then go back to re­lax­ing, I am im­plic­itly dis­be­liev­ing that op­ti­mi­sa­tion pro­cesses more pow­er­ful than hu­man civ­i­liza­tion will look for similar flaws and ex­ploit them, as oth­er­wise my threat level would go up dras­ti­cally.

To clar­ify what this worry looks like: ad­vances in AGI are hope­fully build­ing sys­tems that can scale to be­ing as use­ful and in­tel­li­gent as is phys­i­cally fea­si­ble in our uni­verse—op­ti­mi­sa­tion power way above that of hu­man civ­i­liza­tion’s. As you start get­ting smarter, you need to build more into your sys­tem to make sure the smart bits can’t ex­ploit the sys­tem for their own goals. This as­sumes an epistemic ad­van­tage, as Paul says in the Failure post:

At­tempts to sup­press in­fluence-seek­ing be­hav­ior (call them “im­mune sys­tems”) rest on the sup­pres­sor hav­ing some kind of epistemic ad­van­tage over the in­fluence-seeker. Once the in­fluence-seek­ers can out­think an im­mune sys­tem, they can avoid de­tec­tion and po­ten­tially even com­pro­mise the im­mune sys­tem to fur­ther ex­pand their in­fluence. If ML sys­tems are more so­phis­ti­cated than hu­mans, im­mune sys­tems must them­selves be au­to­mated. And if ML plays a large role in that au­toma­tion, then the im­mune sys­tem is sub­ject to the same pres­sure to­wards in­fluence-seek­ing.

There’s a no­tion whereby if you take a use­ful ma­chine learn­ing sys­tem, and you just make it more pow­er­ful, what you’re es­sen­tially do­ing is in­creas­ing the in­tel­li­gence of the op­ti­mi­sa­tion forces pass­ing through it, in­clud­ing the ad­ver­sar­ial op­ti­mi­sa­tion forces. As you take the sys­tem and make it vastly su­per­in­tel­li­gent, your pri­mary fo­cus needs to be on se­cu­rity from ad­ver­sar­ial forces, rather than pri­mar­ily on mak­ing some­thing that’s use­ful. You’ve be­come an AI se­cu­rity ex­pert, not an AI use­ful­ness ex­pert. The im­por­tant idea is that AI sys­tems can break at higher lev­els of in­tel­li­gence, even if they’re cur­rently quite use­ful.

As I un­der­stand it, this sort of thing hap­pened at Google, who first were a com­puter net­works ex­perts, and then be­came se­cu­rity ex­perts, be­cause for a while the main changes they made to Google Search were to in­crease se­cu­rity and make it harder for peo­ple to game the pager­ank sys­tem. The ad­ver­sar­ial pres­sures on them have since hit ter­mi­nal ve­loc­ity and there prob­a­bly won’t be any fur­ther in­creases, un­less and un­til we build su­per­in­tel­li­gent AI (be it gen­eral or the rele­vant kind of nar­row.)

Marginal Trans­parency Does Not Mean Marginal Security

A key ques­tion in figur­ing out whether to solve this se­cu­rity prob­lem via tech­ni­cal re­search (as op­posed to global co­or­di­na­tion) is whether a line of work differ­en­tially makes this sort of se­cu­rity from op­ti­mi­sa­tion pow­ers eas­ier to work on, or whether it sim­ply makes the sys­tem more use­ful in ways that don’t ac­count for the novel ad­ver­saries.

In the quote from Ro­hin above, and by other peo­ple in other places, it sounds to me like the case be­ing made is that the sys­tem is gen­er­ally be­com­ing more un­der­stand­able, so it’s both eas­ier to build and eas­ier to se­cure. All im­prove­ments in un­der­stand­ing are im­prove­ments in be­ing able to ‘do use­ful things’. And in­deed my gen­eral prior is that in­creas­ing un­der­stand­ing is always bet­ter. But one of the core ar­gu­ments Nick Bostrom makes is that not all sci­en­tific in­sights are pos­i­tive for a civil­i­sa­tion at its cur­rent level of wis­dom, and that we need to be able to or­der which in­sights we get. There are many types of ‘in­creased un­der­stand­ing’, and it’s not the case that all of them are pos­i­tive on the cur­rent mar­gin.

Let me give a con­crete ex­am­ple where more trans­parency in­creases use­ful­ness but not se­cu­rity.

It’s some­times the case that, in pris­ons, the prison staff will at­tempt to get the pris­on­ers to help out with tasks, er­rands and main­te­nance around the prison. How­ever, if you send them into offices and parts of the prison but don’t watch what they do, then they may con­spire against the prison staff—build things that can be used as weapons against the guards, build tun­nels, use com­put­ers to con­tact the out­side world, etc.

Now, if you start look­ing at what the pris­on­ers are do­ing when they’re in office block #17, and you have ba­sic re­wards/​pun­ish­ments (e.g. giv­ing or re­mov­ing ac­cess to spe­cial foods and en­ter­tain­ment things), then you can re­ward them for do­ing good work there, and pun­ish them for do­ing bad work there. You can also pun­ish them for con­spiring against you there.

How­ever, this hasn’t re­ally helped. if they were plan­ning to build a tun­nel in office block #17, they will now just build a tun­nel in office block #18 in­stead. If I can start to see what they do all the time in some lo­ca­tions, then I can pun­ish the ones that aren’t do­ing broadly the right work, and gen­er­ally make sure they’re do­ing use­ful work. But this doesn’t mean that in the places I can’t look that they’re not con­spiring against me. It seems plau­si­ble to me that many im­prove­ments in trans­parency, while mak­ing it eas­ier to make the peo­ple do the prison’s work, do not make it harder to con­spire against me, and just move around where the con­spiring is hap­pen­ing.

If you’re try­ing to del­e­gate and in­cen­tivise peo­ple to do labour for you, you might just think that you should bring in a good man­ager, to get peo­ple to do use­ful work. But in the prison situ­a­tion, you pri­mar­ily need to be­come a se­cu­rity ex­pert, over and above be­ing an ex­pert in how to man­age well. In this situ­a­tion, there are many im­prove­ments in trans­parency of what they’re do­ing that helps force them to do use­ful work, that doesn’t se­cure them against con­spiring with each other to break out of the sys­tem.

With ma­chine learn­ing sys­tems, we already have all the weights on the NN’s to look at, so the sys­tem is max­i­mally trans­par­ent. We can see ev­ery­thing, for cer­tain val­ues of ‘see’. I think the rele­vant ques­tion is “on what level you can un­der­stand what’s go­ing on”. As we get higher-level un­der­stand­ing, we can maybe start to figure out if it’s do­ing cer­tain bad things, or cer­tain good things, and pun­ish/​re­ward those. But just be­cause you’re mak­ing sure that the pro­cess will do some­thing use­ful (e.g. in­vest money, run a hos­pi­tal, clas­sify images) doesn’t mean I know how to tell whether this will lead to the type of full un­der­stand­ing that means that ad­ver­sar­ial work can’t be moved to ar­eas that are too hard /​ very costly for me to un­der­stand.

Res­tate­ment: Marginal im­prove­ments in un­der­stand­abil­ity and trans­parency can make it much eas­ier to make use­ful sys­tems but it’s not nec­es­sar­ily the case that it pro­duces a mean­ingful differ­ence in the abil­ity to pro­duce se­cure sys­tems. It will al­low us, at in­creas­ingly higher lev­els of un­der­stand­ing, to be able to change the type of work needed to ex­ploit the sys­tem; this is not the same as a de­sign that is safe no mat­ter how pow­er­ful the op­ti­mi­sa­tion power against us.

I wrote this in re­sponse to Ray try­ing to figure out how to tell whether any given type of ma­chine learn­ing re­search is mak­ing differ­en­tial progress. The spe­cific type of re­search dis­cussed in that thread has a more de­tailed story which I won’t go into here, and mostly seems very helpful from my lay­man per­spec­tive, but I think that re­search “ei­ther be­ing of zero im­pact, or else mak­ing the whole field more trans­par­ent/​un­der­stand­able” does not mean that the re­search makes differ­en­tial progress on mak­ing the sys­tem se­cure. Trans­parency can in­crease use­ful­ness with­out in­creas­ing se­cu­rity.

In one sense, a ma­chine learn­ing sys­tem is max­i­mally trans­par­ent—I can see ev­ery part of what it is do­ing. But while I don’t un­der­stand its rea­son­ing, while there are lev­els on which I don’t know what it’s think­ing, by de­fault I’m not con­fi­dent that ad­ver­sar­ial thought hasn’t just moved there in­stead.

Cur­rent Tech­ni­cal Work on Security

From this per­spec­tive, let me talk about the re­search that seems like it’s aiming to help on the se­cu­rity front. This is not all the work be­ing done, just the work that I feel I un­der­stand well enough to sum­marise from this per­spec­tive.

My un­der­stand­ing is that the main work at­tempt­ing to pin­point where op­ti­mi­sa­tion en­ters the sys­tem in sur­pris­ing ways is Hub­inger, Mikulik, Skalse, van Mer­wijk and Garrabrant’s work on risks from learned op­ti­mi­sa­tion (pa­per, se­quence). This gives lots of names to con­cepts de­scribing how op­ti­misers work, and asks ques­tions like:

  • Un­der what con­di­tions will my learned al­gorithm it­self do op­ti­mi­sa­tion?

  • When the learned al­gorithm does op­ti­mi­sa­tion, what will its ob­jec­tive be, and what will the re­la­tion­ship be be­tween its ob­jec­tive and the loss func­tion of the neu­ral net that pro­duced it?

  • If the learned op­ti­miser has suc­cess­fully built a model of the ob­jec­tive func­tion that was used to build it, what con­di­tions pre­dict whether it will it be work­ing around my ob­jec­tive as op­posed to to­ward it?

  • When should I ex­pected the op­ti­miser in the learned al­gorithm to try to de­ceive me?

The pa­per also asks whether it’s pos­si­ble to pre­vent in­fluence-seek­ing al­gorithms from en­ter­ing your sys­tems by cre­at­ing com­plex­ity mea­sures on the sys­tem, such as time and space penalties. On this topic, Paul Chris­ti­ano has asked whether re­quiring sys­tems be max­i­mally effi­cient ac­cord­ing to cir­cuit de­scrip­tion length re­moves all ad­ver­sar­ial be­havi­our; and Evan has offered an an­swer in the nega­tive.

It’s also the case that the Agent Foun­da­tions team at MIRI is try­ing to think about the prob­lem of in­ner al­ign­ment more broadly, and poke at var­i­ous con­cepts around here, such as in their write­ups on Ro­bust Del­e­ga­tion and Sub­sys­tem Align­ment. This ex­plores many sim­ple back­ground ques­tions to which we don’t have prin­ci­pled an­swers, and can­not draw toy mod­els of in­tel­li­gent agents that re­li­ably get these prob­lems right.

  • Is there a prin­ci­pled way to figure out whether I should trust that some­thing more in­tel­li­gent than me shares my val­ues, given that I can’t figure out ex­actly what it’s go­ing to do? If I am a child, some­times adults will do some­thing that the op­po­site of what I want—is there a way of figur­ing out whether they’re do­ing this in ac­cor­dance with my goals?

  • How should I tell a more in­tel­li­gent agent than me what I want it to do, given that I don’t know ev­ery­thing about what I want? This is es­pe­cially hard given that op­ti­mi­sa­tion am­plifies the differ­ences be­tween what I say I want and what I ac­tu­ally want (aka Good­hart’s Law).

  • How do I make sure the differ­ent parts of a mind are in a good bal­ance, rather than some parts over­pow­er­ing other parts? When it comes to my own mind, some­times differ­ent parts get out of whack and I be­come too self-crit­i­cal, or over­con­fi­dent, or de­pressed, or manic. Is there a prin­ci­pled way of think­ing about this?

  • How do I give an­other agent a good de­scrip­tion of what to do in a do­main, with­out teach­ing them ev­ery­thing I know about the do­main? This is a prob­lem in com­pa­nies, where some­times peo­ple who don’t un­der­stand the whole vi­sion can make bad trade­offs.

That’s the work I feel I have a ba­sic un­der­stand­ing of. I’m cu­ri­ous about ex­pla­na­tions of how other work fits into this frame­work.