Partial Agency

Epistemic sta­tus: very rough in­tu­itions here.

I think there’s some­thing in­ter­est­ing go­ing on with Evan’s no­tion of my­opia.

Evan has been call­ing this thing “my­opia”. Scott has been call­ing it “stop-gra­di­ents”. In my own mind, I’ve been call­ing the phe­nomenon “di­rec­tion­al­ity”. Each of these words gives a differ­ent set of in­tu­itions about how the cluster could even­tu­ally be for­mal­ized.


Nash equil­ibria are, ab­stractly, mod­el­ing agents via an equa­tion like . In words: is the agent’s mixed strat­egy. The pay­off is a func­tion of the mixed strat­egy in two ways: the first ar­gu­ment is the causal chan­nel, where ac­tions di­rectly have effects; the sec­ond ar­gu­ment rep­re­sents the “acausal” chan­nel, IE, the fact that the other play­ers know the agent’s mixed strat­egy and this in­fluences their ac­tions. The agent is max­i­miz­ing across the first chan­nel, but “ig­nor­ing” the sec­ond chan­nel; that is why we have to solve for a fixed point to find Nash equil­ibria. This mo­ti­vates the no­tion of “stop gra­di­ent”: if we think in terms of neu­ral-net­work type learn­ing, we’re send­ing the gra­di­ent through the first ar­gu­ment but not the sec­ond. (It’s a kind of math­e­mat­i­cally weird thing to do!)


Think­ing in terms of iter­ated games, we can also jus­tify the la­bel “my­opia”. Think­ing in terms of “gra­di­ents” sug­gests that we’re do­ing some kind of train­ing in­volv­ing re­peat­edly play­ing the game. But we’re train­ing an agent to play as if it’s a sin­gle-shot game: the gra­di­ent is re­ward­ing be­hav­ior which gets more re­ward within the sin­gle round even if it com­pro­mises long-run re­ward. This is a weird thing to do: why im­ple­ment a train­ing regime to pro­duce strate­gies like that, if we be­lieve the nash-equil­ibrium model, IE we think the other play­ers will know our mixed strat­egy and re­act to it? We can, for ex­am­ple, win chicken by go­ing straight more of­ten than is my­opi­cally ra­tio­nal. Gen­er­ally speak­ing, we ex­pect to get bet­ter re­wards in the rounds af­ter train­ing if we op­ti­mized for non-my­opic strate­gies dur­ing train­ing.


To jus­tify my term “di­rec­tion­al­ity” for these phe­nom­ena, we have to look at a differ­ent ex­am­ple: the idea that “when be­liefs and re­al­ity don’t match, we change our be­liefs”. IE: when op­ti­miz­ing for truth, we op­ti­mize “only in one di­rec­tion”. How is this pos­si­ble? We can write down a loss func­tion, such as Bayes’ loss, to define ac­cu­racy of be­lief. But how can we op­ti­mize it only “in one di­rec­tion”?

We can see that this is the same thing as my­opia. When train­ing pre­dic­tors, we only con­sider the effi­cacy of hy­pothe­ses one in­stance at a time. Con­sider su­per­vised learn­ing: we have “ques­tions” etc and are try­ing to learn “an­swers” etc. If a neu­ral net­work were some­how able to mess with the train­ing data, it would not have much pres­sure to do so. If it could give an an­swer on in­stance which im­proved its abil­ity to an­swer on by ma­nipu­lat­ing , the gra­di­ent would not spe­cially fa­vor this. Sup­pose it is pos­si­ble to take some small hit (in log-loss terms) on for a large gain on . The large gain for would not re­in­force the spe­cific neu­ral pat­terns re­spon­si­ble for mak­ing easy (only the pat­terns re­spon­si­ble for suc­cess­fully tak­ing ad­van­tage of the eas­i­ness). The small hit on means there’s an in­cen­tive not to ma­nipu­late .

It is pos­si­ble that the neu­ral net­work learns to ma­nipu­late the data, if by chance the neu­ral pat­terns which shift are the same as those which suc­cess­fully ex­ploit the ma­nipu­la­tion at . How­ever, this is a frag­ile situ­a­tion: if there are other neu­ral sub-pat­terns which are equally ca­pa­ble of giv­ing the easy an­swer on , the re­ward gets spread around. (Think of these as par­a­sites tak­ing ad­van­tage of the ma­nipu­la­tive strat­egy with­out do­ing the work nec­es­sary to sus­tain it.) Be­cause of this, the ma­nipu­la­tive sub-pat­tern may not “make rent”: the amount of pos­i­tive gra­di­ent it gets may not make up for the hit it takes on . And all the while, neu­ral sub-pat­terns which do bet­ter on (by re­fus­ing to take the hit) will be grow­ing stronger. Even­tu­ally they can take over. This is ex­actly like my­opia: strate­gies which do bet­ter in a spe­cific case are fa­vored for that case, de­spite global loss. The neu­ral net­work fails to suc­cess­fully co­or­di­nate with it­self to globally min­i­mize loss.

To see why this is also like stop-gra­di­ents, think about the loss func­tion as : the neu­ral weights de­ter­mine loss through a “le­gi­t­i­mate” chan­nel (the pre­dic­tion qual­ity on a sin­gle in­stance), plus an “ille­gi­t­i­mate” chan­nel (the cross-in­stance in­fluence which al­lows ma­nipu­la­tion of through the an­swer given for ). We’re op­ti­miz­ing through the first chan­nel, but not the sec­ond.

The differ­ence be­tween su­per­vised learn­ing and re­in­force­ment learn­ing is just: re­in­force­ment learn­ing ex­plic­itly tracks helpful­ness of strate­gies across time, rather than as­sum­ing a high score at has to do with only be­hav­iors at ! As a re­sult, RL can co­or­di­nate with it­self across time, whereas su­per­vised learn­ing can­not.

Keep in mind that this is a good thing: the al­gorithm may be “leav­ing money on the table” in terms of pre­dic­tion ac­cu­racy, but this is ex­actly what we want. We’re try­ing to make the map match the ter­ri­tory, not the other way around.

Im­por­tant side-note: this ar­gu­ment ob­vi­ously has some re­la­tion to the ques­tion of how we should think about in­ner op­ti­miz­ers and how likely we should ex­pect them to be. How­ever, I think it is not a di­rect ar­gu­ment against in­ner op­ti­miz­ers. (1) The emer­gence of an in­ner op­ti­mizer is ex­actly the sort of situ­a­tion where the gra­di­ents end up all feed­ing through one co­her­ent struc­ture. Other po­ten­tial neu­ral struc­tures can­not com­pete with the sub-agent, be­cause it has started to in­tel­li­gently op­ti­mize; few in­ter­lop­ers can take ad­van­tage of the benefits of the in­ner op­ti­mizer’s strat­egy, be­cause they don’t know enough to do so. So, all gra­di­ents point to con­tin­u­ing the im­prove­ment of the in­ner op­ti­mizer rather than al­ter­nate more-my­opic strate­gies. (2) Be­ing an in­ner op­ti­mizer is non syn­ony­mous with non-my­opic be­hav­ior. An in­ner op­ti­mizer could give my­opic re­sponses on the train­ing set while in­ter­nally hav­ing less-my­opic val­ues. Or, an in­ner op­ti­mizer could have my­opic but very di­ver­gent val­ues. Im­por­tantly, an in­ner op­ti­mizer need not take ad­van­tage of any data-ma­nipu­la­tion of the train­ing set like that I’ve de­scribed; it need not even have ac­cess to any such op­por­tu­ni­ties.

The Par­tial Agency Paradox

I’ve given a cou­ple of ex­am­ples. I want to quickly give some more to flesh out the clusters as I see them:

  • As I said, my­opia is “par­tial agency” whereas fore­sight is “full agency”. Think of how an agent with high time-prefer­ence (ie steep tem­po­ral dis­count­ing) can be money-pumped by an agent with low time-prefer­ence. But the limit of no-tem­po­ral-dis­count­ing-at-all is not always well-defined.

  • An up­date­full agent is “par­tial agency” whereas up­date­less­ness is “full agency”: the up­date­ful agent is failing to use some chan­nels of in­fluence to get what it wants, be­cause it already knows those things and can’t imag­ine them go­ing differ­ently. Again, though, full agency seems to be an ideal­iza­tion we can’t quite reach: we don’t know how to think about up­date­less­ness in the con­text of log­i­cal un­cer­tainty, only more- or less- up­date­full strate­gies.

  • I gave the be­liefster­ri­tory ex­am­ple. We can also think about the val­uester­ri­tory case: when the world differs from our prefer­ences, we change the world, not our prefer­ences. This has to do with avoid­ing wire­head­ing.

  • Similarly, we can think of ex­am­ples of cor­rigi­bil­ity—such as re­spect­ing an off but­ton, or avoid­ing ma­nipu­lat­ing the hu­mans—as par­tial agency.

  • Causal de­ci­sion the­ory is more “par­tial” and ev­i­den­tial de­ci­sion the­ory is less so: EDT wants to rec­og­nize more things as le­gi­t­i­mate chan­nels of in­fluence, while CDT claims they’re not. Keep in mind that the math of causal in­ter­ven­tion is closely re­lated to the math which tells us about whether an agent wants to ma­nipu­late a cer­tain vari­able—so there’s a close re­la­tion­ship be­tween CDT-vs-EDT and wire­head­ing/​cor­rigi­bil­ity.

I think peo­ple of­ten take a pro- or anti- par­tial agency po­si­tion: if you are try­ing to one-box in New­comblike prob­lems, try­ing to co­op­er­ate in pris­oner’s dilemma, try­ing to define log­i­cal up­date­less­ness, try­ing for su­per­ra­tional­ity in ar­bi­trary games, etc… you are gen­er­ally try­ing to re­move bar­ri­ers to full agency. On the other hand, if you’re try­ing to avert in­stru­men­tal in­cen­tives, make sure an agent al­lows you to change its val­ues, or doesn’t pre­vent you from press­ing an off but­ton, or doesn’t ma­nipu­late hu­man val­ues, etc… you’re gen­er­ally try­ing to add bar­ri­ers to full agency.

I’ve his­tor­i­cally been more in­ter­ested in drop­ping bar­ri­ers to full agency. I think this is par­tially be­cause I tend to as­sume that full agency is what to ex­pect in the long run, IE, “all agents want to be full agents”—evolu­tion­ar­ily, philo­soph­i­cally, etc. Full agency should re­sult from in­stru­men­tal con­ver­gence. At­tempts to en­g­ineer par­tial agency for spe­cific pur­poses feel like fight­ing against this im­mense pres­sure to­ward full agency; I tend to as­sume they’ll fail. As a re­sult, I tend to think about AI al­ign­ment re­search as (1) need­ing to un­der­stand full agency much bet­ter, (2) need­ing to mainly think in terms of al­ign­ing full agency, rather than avert­ing risks through par­tial agency.

How­ever, in con­trast to this his­tor­i­cal view of mine, I want to make a few ob­ser­va­tions:

  • Par­tial agency some­times seems like ex­actly what we want, as in the case of mapter­ri­tory op­ti­miza­tion, rather than a crude hack which ar­tifi­cially limits things.

  • In­deed, par­tial agency of this kind seems fun­da­men­tal to full agency.

  • Par­tial agency seems ubiquitous in na­ture. Why should I treat full agency as the de­fault?

So, let’s set aside pro/​con po­si­tions for a while. What I’m in­ter­ested in at the mo­ment is the de­scrip­tive study of par­tial agency as a phe­nomenon. I think this is an or­ga­niz­ing phe­nomenon be­hind a lot of stuff I think about.

The par­tial agency para­dox is: why do we see par­tial agency nat­u­rally aris­ing in cer­tain con­texts? Why are agents (so of­ten) my­opic? Why have a no­tion of “truth” which is about mapter­ri­tory fit but not the other way around? Par­tial agency is a weird thing. I un­der­stand what it means to op­ti­mize some­thing. I un­der­stand how a se­lec­tion pro­cess can arise in the world (evolu­tion, mar­kets, ma­chine learn­ing, etc), which drives things to­ward max­i­miza­tion of some func­tion. Par­tial op­ti­miza­tion is a com­par­a­tively weird thing. Even if we can set up a “par­tial se­lec­tion pro­cess” which in­cen­tivises max­i­miza­tion through only some chan­nels, wouldn’t it be blind to the side-chan­nels, and so un­able to en­force par­tial­ity in the long-term? Can’t some­one always come along and do bet­ter via full agency, no mat­ter how our in­cen­tives are set up?

Of course, I’ve already said enough to sug­gest a re­s­olu­tion to this puz­zle.

My ten­ta­tive re­s­olu­tion to the para­dox is: you don’t build “par­tial op­ti­miz­ers” by tak­ing a full op­ti­mizer and try­ing to add care­fully bal­anced in­cen­tives to cre­ate in­differ­ence about op­ti­miz­ing through a spe­cific chan­nel, or any­thing like that. (In­differ­ence at the level of the se­lec­tion pro­cess does not lead to in­differ­ence at the level of the agents evolved by that se­lec­tion pro­cess.) Rather, par­tial agency is what se­lec­tion pro­cesses in­cen­tivize by de­fault. If there’s a learn­ing-the­o­retic setup which in­cen­tivizes the de­vel­op­ment of “full agency” (what­ever that even means, re­ally!) I don’t know what it is yet.


Learn­ing is ba­si­cally epi­sodic. In or­der to learn, you (sort of) need to do the same thing over and over, and get feed­back. Re­in­force­ment learn­ing tends to as­sume er­godic en­vi­ron­ments so that, no mat­ter how badly the agent messes up, it even­tu­ally re-en­ters the same state so it can try again—this is a “soft” epi­sode bound­ary. Similarly, RL tends to re­quire tem­po­ral dis­count­ing—this also cre­ates a soft epi­sode bound­ary, be­cause things far enough in the fu­ture mat­ter so lit­tle that they can be thought of as “a differ­ent epi­sode”.

So, like mapter­ri­tory learn­ing (that is, epistemic learn­ing), we can kind of ex­pect any type of learn­ing to be my­opic to some ex­tent.

This fits the pic­ture where full agency is an ideal­iza­tion which doesn’t re­ally make sense on close ex­am­i­na­tion, and par­tial agency is the more real phe­nomenon. How­ever, this is ab­solutely not a con­jec­ture on my part that all learn­ing al­gorithms pro­duce par­tial agents of some kind rather than full agents. There may still be frame­works which al­low us to ap­proach full agency in the limit, such as tak­ing the limit of diminish­ing dis­count fac­tors, or con­sid­er­ing asymp­totic be­hav­ior of agents who are able to make pre­com­mit­ments. We may be able to achieve some as­pects of full agency, such as su­per­ra­tional­ity in games, with­out oth­ers.

Again, though, my in­ter­est here is more to un­der­stand what’s go­ing on. The point is that it’s ac­tu­ally re­ally easy to set up in­cen­tives for par­tial agency, and not so easy to set up in­cen­tives for full agency. So it makes sense that the world is full of par­tial agency.

Some ques­tions:

  • To what ex­tent is it re­ally true that set­tings such as su­per­vised learn­ing dis­in­cen­tivize strate­gic ma­nipu­la­tion of the data? Can my ar­gu­ment be for­mal­ized?

  • If think­ing about “op­ti­miz­ing a func­tion” is too coarse-grained (a su­per­vised learner doesn’t ex­actly min­i­mize pre­dic­tion er­ror, for ex­am­ple), what’s the best way to re­vise our con­cepts so that par­tial agency be­comes ob­vi­ous rather than coun­ter­in­tu­itive?

  • Are there bet­ter ways of char­ac­ter­iz­ing the par­tial­ity of par­tial agents? Does my­opia cover all cases (so that we can un­der­stand things in terms of time-prefer­ence), or do we need the more struc­tured stop-gra­di­ent for­mu­la­tion in gen­eral? Or per­haps a more causal-di­a­gram-ish no­tion, as my “di­rec­tion­al­ity” in­tu­ition sug­gests? Do the differ­ent ways of view­ing things have nice re­la­tion­ships to each other?

  • Should we view par­tial agents as mul­ti­a­gent sys­tems? I’ve char­ac­ter­ized it in terms of some­thing re­sem­bling game-the­o­retic equil­ibrium. The ‘par­tial’ op­ti­miza­tion of a func­tion arises from the price of an­ar­chy, or as it’s known around less­wrong, Moloch. Are par­tial agents re­ally bags of full agents keep­ing each other down? This seems a lit­tle true, to me, but also doesn’t strike me as the most use­ful way of think­ing about par­tial agents. For one thing, it takes full agents as a nec­es­sary con­cept to build up par­tial agents, which seems wrong to me.

  • What’s the re­la­tion­ship be­tween the se­lec­tion pro­cess (learn­ing pro­cess, mar­ket, …) and the type of par­tial agents in­cen­tivised by it? If we think in terms of my­opia: given a type of my­opia, can we de­sign a train­ing pro­ce­dure which tracks or doesn’t track the rele­vant strate­gic in­fluences? If we think in terms of stop-gra­di­ents: we can take “stop-gra­di­ent” liter­ally and stop there, but I sus­pect there is more to be said about de­sign­ing train­ing pro­ce­dures which dis­in­cen­tivize the strate­gic use of speci­fied paths of in­fluence. If we think in terms of di­rec­tion­al­ity: how do we get from the ab­stract “change the map to match the ter­ri­tory” to the con­crete de­tails of su­per­vised learn­ing?

  • What does par­tial agency say about in­ner op­ti­miz­ers, if any­thing?

  • What does par­tial agency say about cor­rigi­bil­ity? My hope is that there’s a ver­sion of cor­rigi­bil­ity which is a perfect fit in the same way that mapter­ri­tory op­ti­miza­tion seems like a perfect fit.

Ul­ti­mately, the con­cept of “par­tial agency” is prob­a­bly con­fused. The par­tial/​full clus­ter­ing is very crude. For ex­am­ple, it doesn’t make sense to think of a non-wire­head­ing agent as “par­tial” be­cause of its re­fusal to wire­head. And it might be odd to con­sider a my­opic agent as “par­tial”—it’s just a time-prefer­ence, noth­ing spe­cial. How­ever, I do think I’m point­ing at a phe­nomenon here, which I’d like to un­der­stand bet­ter.