Power as Easily Exploitable Opportunities

(Talk given at an event on Sun­day 28th of June. TurnTrout is re­spon­si­ble for the talk, Ja­cob Lager­ros and David Lam­bert ed­ited the tran­script.

If you’re a cu­rated au­thor and in­ter­ested in giv­ing a 5-min talk, which will then be tran­scribed and ed­ited, sign up here.)

TurnTrout: Power and power-seek­ing play a big part in my think­ing about AI al­ign­ment, and I think it’s also an in­ter­est­ing topic more gen­er­ally.

Why is this a big deal? Why might we want to think about what power is? What is it, ex­actly, that peo­ple are think­ing of when they con­sider some­one as pow­er­ful?

Well, it seems like a lot of al­ign­ment failures are ex­am­ples of this power-seek­ing where the agent is try­ing to be­come more ca­pa­ble of achiev­ing its goals, whether that is get­ting out of its box, tak­ing over the world, or even just re­fus­ing cor­rec­tion or a shut­down.

If we tie these to­gether, we have some­thing I’ve called the catas­trophic con­ver­gence con­jec­ture, which is that if the goals aren’t al­igned and it causes a catas­tro­phe, it is be­cause of power-seek­ing.

But I think that when peo­ple are first con­sid­er­ing al­ign­ment, they think to them­selves, “What’s the big deal? You gave it a weird goal, and it does weird stuff. We’ll just fix that.”

So why ex­actly do un­al­igned goal max­i­miz­ers tend to cause catas­tro­phes? I think it’s be­cause of this power seek­ing. Let me ex­plain.

The way I think about power is as the abil­ity to achieve goals in gen­eral.

In the liter­a­ture, this is like the dis­po­si­tional power-to no­tion.

Whereas in the past, peo­ple thought “Well, in terms of causal­ity, are the agent’s ac­tions nec­es­sary and/​or suffi­cient to cause a wider range of out­comes?”, here, I think it’s best thought of as your av­er­age abil­ity to op­ti­mize a wide range of differ­ent goals. So if you for­mal­ize this as your av­er­age op­ti­mal value in, say, a Markov de­ci­sion pro­cess (MDP), there’s a lot of nice prop­er­ties and you can prove that, at least in cer­tain situ­a­tions, it links up with in­stru­men­tal con­ver­gence. Power seek­ing and in­stru­men­tal con­ver­gence are very closely re­lated.

But there’s a catch here. We’re talk­ing about av­er­age op­ti­mal value. This can be pretty weird. Let’s say you’re in the un­for­tu­nate situ­a­tion of hav­ing a dozen sol­diers about to shoot you. How pow­er­ful are you ac­cord­ing to av­er­age op­ti­mal value? Well, av­er­age op­ti­mal value is still prob­a­bly quite high.

There’s prob­a­bly an ad­ver­sar­ial in­put of strange mo­tor com­mands you could is­sue which would es­sen­tially in­ca­pac­i­tate all the sol­diers just be­cause they’re look­ing at you since their brains are not se­cure sys­tems. So each op­ti­mal policy would prob­a­bly start off with, “I do this weird se­ries of twitches, in­ca­pac­i­tate them, and then I just go about achiev­ing my goals.”

So we’d like to say, “well, your power is ac­tu­ally low­ered here in a sense”, or else we’d have to con­cede that it’s just wholly sub­jec­tive what peo­ple are think­ing of when they feel pow­er­ful.

My fa­vorite solu­tion is, in­stead of ask­ing how well could I achieve a bunch of differ­ent goals? You should be ask­ing, how well could I achieve many goals?

If you imag­ine some­thing like a learn­ing al­gorithm, you could say it’s a hu­man level learn­ing al­gorithm. You give it a his­tory of ob­ser­va­tions and a goal that it’s op­ti­miz­ing, and it pro­duces a policy, or things that it should do to achieve this goal. You then say, “Well, what’s my av­er­age abil­ity? What’s A’s av­er­age abil­ity? What’s this al­gorithm’s av­er­age abil­ity to op­ti­mize and to achieve goals in this his­tory, in this situ­a­tion?”

What I think this does is re­cover this com­mon sense no­tion of “you don’t have much power here be­cause these aren’t cog­ni­tively ac­cessible op­por­tu­ni­ties and poli­cies”. And so es­sen­tially, you are dis­em­pow­ered in this situ­a­tion.

I think un­der­stand­ing this also makes sense of what power means in a uni­verse where ev­ery­one is only go­ing to have one course of ac­tion. If you view them as run­ning al­gorithms and then say­ing, “Well, how well could this learn­ing al­gorithm achieve differ­ent goals in the situ­a­tion?” I think it might be im­por­tant to eval­u­ate AI de­signs by how they re­spect our power in par­tic­u­lar, and so un­der­stand­ing what that means is prob­a­bly im­por­tant.

Also, if you want to do bet­ter than just hard goal max­i­miza­tion in al­ign­ing these AIs, then I think un­der­stand­ing ex­actly what the rot is at the heart of re­ward max­i­miza­tion is pretty im­por­tant as well. Thank you.


Daniel Filan: If I’m think­ing about a learn­ing al­gorithm like Q-learn­ing or PPO or some­thing, then it makes a lot of sense to think that it’s a func­tion of a goal and a his­tory. But in most situ­a­tions, I tend to think of them as re­sults of learn­ing al­gorithms.

Take some Atari agent. It trained for a while and now it is like a de­ployed sys­tem. It is play­ing Atari and it is not man­i­festly a func­tion of a goal. Maybe it has a goal some­where in its neu­ral net­work and you could change some bits and it would have a differ­ent fol­low-up move, but that’s not ob­vi­ous.

So I’m won­der­ing, what do you think of this func­tional form of agents as func­tions of his­to­ries and goals?

TurnTrout: Good ques­tion. I think that when we’re mak­ing an ob­jec­tion like this, es­pe­cially be­fore we’ve solved more is­sues with em­bed­ded agency, we’re just go­ing to have to say the fol­low­ing: “If we want to un­der­stand what this per­son is think­ing of when they think of power; then I think that even though it might not liter­ally be true, that you could cleanly de­com­pose a per­son like this, it’s still a use­ful ab­strac­tion.”

I would agree that if we wanted to ac­tu­ally im­ple­ment this and say, “Well, we’re look­ing at an agent, and we de­duce what its learn­ing al­gorithm is and what it would mean to have a mod­u­lar goal in­put to the al­gorithm,” then you would re­ally need to be wor­ried about this. But my per­spec­tive, at least for right now in this early stage, is that it’s more of a con­cep­tual tool. But I agree, you can split up a lot of agents like this.

Ben Pace: I’m cu­ri­ous if you have any more spe­cific ideas for mea­sur­ing which poli­cies are cur­rently at­tain­able by a par­tic­u­lar agent or al­gorithm — the no­tion of “at­tain­abil­ity” felt like it was do­ing a lot of work.

TurnTrout: I think the thing we’re as­sum­ing here is, imag­ine you have an al­gorithm that is about as in­tel­li­gent, with re­spect to the Legg-Hut­ter met­ric or some other more com­mon-sense no­tion, as a hu­man. Imag­ine you can give it a bunch of differ­ent re­ward func­tion in­puts. I think this is a good way of quan­tify­ing this agent’s power. But you’re ask­ing how we get this hu­man level al­gorithm?

Ben Pace: Yes. It just sounded like you said, “In this situ­a­tion, the hu­man agent, in prin­ci­ple, has an in­cred­ible amount of power be­cause there is a very spe­cific thing you can do.” But to ac­tu­ally mea­sure its im­pact, you have to talk about the space of ac­tual op­er­a­tions that it can find or some­thing.

And I thought, “I don’t have a good sense of how to define ex­actly what solu­tions are find­able by a hu­man, and which solu­tions are not find­able by a hu­man.” And similarly, you don’t know for var­i­ous AIs how to think about which ones are find­able. Be­cause at some point, some AI gets to do some mag­i­cal wire­head­ing thing, and there’s some bridge it crosses where you re­al­ize that you could prob­a­bly start tak­ing more con­trol in the world or some­thing. I don’t quite know how to mea­sure when those things be­come at­tain­able.

TurnTrout: There are a cou­ple ways you can pose con­straints through this frame­work, and one would be only giv­ing it a cer­tain amount of his­tory. You’re not giv­ing in­finite data.

Another one would be try­ing to get some bounded cog­ni­tion into the al­gorithm by just hav­ing it stop search­ing af­ter a cer­tain amount of time.

I don’t have clean an­swers for this yet, but I agree. Th­ese are good things to think about.

habryka: One thing that I’ve been most con­fused about for the for­mal­ism for power that you’ve been think­ing about, is that you do this av­er­ag­ing op­er­a­tion on your util­ity func­tion. But av­er­ag­ing over a space is not a free op­er­a­tion. You need some mea­sure on the space from which you sam­ple.

It feels to me like, power only ap­pears when you choose a very spe­cific mea­sure over the space of util­ity func­tions. For ex­am­ple, if I sub-sam­ple from the space of util­ity func­tions that are ex­tremely weird and re­ally like not be­ing able to do things, it will only care about shut­ting it­self off rather than whether it’s go­ing to get any power-seek­ing be­hav­ior.

So am I mi­s­un­der­stand­ing things? Is this true?

TurnTrout: The ap­proach I’ve taken, like in my re­cent pa­per, for ex­am­ple, is to as­sume you’re in some sys­tem with finite states. You then take, for ex­am­ple, the MaxEnt dis­tri­bu­tion over re­ward func­tions or you as­sume that re­ward is, at least, IID over states. You then get a neu­tral­ity where I don’t think you need a ton of in­for­ma­tion about what the rea­son­able goals you should pur­sue are.

I think if you just take a MaxEnt dis­tri­bu­tion, you’ll re­cover the nor­mal no­tion of power. But if you’re talk­ing about util­ity func­tions, then be­cause there’s in­finitely many, it’s like, “Well, what’s the MaxEnt dis­tri­bu­tion over that?”

And so far, the the­o­rems are about just finite MDPs. And if you’re only talk­ing about find­ing MDPs and not some kind of uni­ver­sal prior, then you don’t need to worry about it be­ing ma­lign.

Rob Miles: Some­thing I’m a lit­tle un­clear on is how this can ever change over time. I feel like that’s some­thing you want to say. Right now, you’re in the box. And then if you get out of the box, you have more power be­cause now there’s a path that you’re able to fol­low.

But if you are in the box and you can think of a good plan for get­ting out, isn’t there a sense that you already have that power? Be­cause you’re aware of a plan that gets you what you want via get­ting out of the box? How do you sep­a­rate power now from the po­ten­tial for power in the fu­ture?

TurnTrout: Good ques­tion. This is the big is­sue: think­ing about power in terms of op­ti­mal value. If you have an agent that has con­sis­tent be­liefs about the fu­ture, you’re not go­ing to ex­pect to gain more.

If you’re try­ing to max­i­mize your power, you’re not go­ing to ex­pect, nec­es­sar­ily, to in­crease your power just due to con­ser­va­tion of ex­pected ev­i­dence. But if things hap­pen to you and you’re sur­prised by them, then you see your­self los­ing or gain­ing power, es­pe­cially if you’re not op­ti­mal.

So if it’s too hard for me to get out of the box or I think it’s too hard, but then some­one lets me out, only af­ter that would I see my­self as hav­ing a lot more power.