Deconfuse Yourself about Agency

This post is a re­sult of nu­mer­ous dis­cus­sions with other par­ti­ci­pants and or­ga­niz­ers of the MIRI Sum­mer Fel­lows Pro­gram 2019.

I re­cently (hope­fully) dis­solved some of my con­fu­sion about agency. In the first part of the post, I de­scribe a con­cept that I be­lieve to be cen­tral to most de­bates around agency. I then briefly list some ques­tions and ob­ser­va­tions that re­main in­ter­est­ing to me. The gist of the post should make sense with­out read­ing any of the math.

An­tropo­mor­phiza­tion, but with ar­chi­tec­tures that aren’t humans


Con­sider the fol­low­ing ex­am­ples of “ar­chi­tec­tures”:

Ex­am­ple (ar­chi­tec­tures)

  1. Ar­chi­tec­tures I would in­tu­itively call “agenty”:

    1. Monte Carlo tree search al­gorithm, parametrized by the num­ber of rol­louts made each move and util­ity func­tion (or heuris­tic) used to eval­u­ate po­si­tions.

    2. (semi-vague) “Clas­si­cal AI-agent” with sev­eral in­ter­con­nected mod­ules (util­ity func­tion and world model, ac­tions, plan­ning al­gorithm, and ob­ser­va­tions used for learn­ing and up­dat­ing the world model).

    3. (vague) Hu­man parametrized by their goals, knowl­edge, and skills (and, of course, many other de­tails).

  2. Ar­chi­tec­tures I would in­tu­itively call “non-agenty”:

    1. A hard-coded se­quence of ac­tions.

    2. Look-up table.

    3. Ran­dom gen­er­a­tor (out­putting on ev­ery in­put, for some prob­a­bil­ity dis­tri­bu­tion ).

  3. Multi-agent ar­chi­tec­tures[1]:

    1. Ant colony.

    2. Com­pany (con­sist­ing of in­di­vi­d­ual em­ploy­ees, op­er­at­ing within an econ­omy).

    3. Com­pre­hen­sive AI ser­vices.

Work­ing defi­ni­tion: Ar­chi­tec­ture is some model parametriz­able by that re­ceives in­puts, pro­duces out­puts, and pos­si­bly keeps an in­ter­nal state. We de­note spe­cific in­stances of as .

Gen­er­al­iz­ing anthropomorphization

Through­out the post, will re­fer to some ob­ject, proc­ces, en­tity, etc., whose be­hav­ior we want to pre­dict or un­der­stand. Ex­am­ples in­clude rocks, wind, an­i­mals, hu­mans, AGIs, economies, fam­i­lies, or the uni­verse.

A stan­dard item in the hu­man men­tal toolbox is an­thro­po­mor­phiza­tion: mod­el­ing var­i­ous things as hu­mans (speci­fi­cally, our­selves) with “funny” goals or abil­ities. We can make the same men­tal move for ar­chi­tec­tures other than hu­mans:

Work­ing defi­ni­tion (-mor­phiza­tion): Let be an ar­chi­tec­ture. Then any[2] model is an -mor­phiza­tion of .

An­tropo­mor­phiza­tion makes good pre­dic­tions for other hu­mans and some an­i­mals (cu­ri­os­ity, fear, hunger). On the other hand, it doesn’t work so well for rocks, light­ning, and AGI-s—not that it would pre­vent us from us­ing it any­way. We can mea­sure the use­ful­ness of -mor­phiza­tion by the de­gree to which it makes good pre­dic­tions:

Work­ing defi­ni­tion (pre­dic­tion er­ror): Sup­pose ex­ists in a world and is a se­quence of vari­ables (events about ) that we want to pre­dict. Sup­pose that is how ac­tu­ally un­folds and is the pre­dic­tion ob­tained by -mor­phiz­ing as . The pre­dic­tion er­ror of (w.r.t. and in ) is the ex­pected Brier score of with re­spect to .

In­for­mally, we say that -mor­phiz­ing is ac­cu­rate if the cor­re­spond­ing pre­dic­tion er­ror is low.[3]

When do we call things agents?

Main claim:

  1. I claim that the ques­tion “Is an agent?” is with­out sub­stance, and we should in­stead be ask­ing “From the point of view of some ex­ter­nal ob­server , does seem to ex­hibit agent-like be­hav­ior?”.

  2. More­over, “agent-like be­hav­ior” also seems ill-defined, be­cause what we as­so­ci­ate with “agency” is sub­jec­tive. I pro­pose to ex­plic­itly op­er­a­tional­ize the ques­tion as “Is -mor­phiz­ing ac­cu­rate?”.

(A re­lated ques­tion is how difficult is it for us to “run” . In­deed, we an­thro­po­mor­phize so many things pre­cisely be­cause it is cheap for us to do so.)

Re­lat­edly, I be­lieve we already im­plic­itly do this op­er­a­tional­iza­tion: Sup­pose you talk to your fa­vorite hu­man about agency. will likely sub­con­sciously as­so­ci­ate agency with cer­tain ar­chi­tec­tures, maybe such as those in Ex­am­ple 1.1-3. More­over, will as­cribe vary­ing de­grees of agency to differ­ent ar­chi­tec­tures—for me, 1.3 seems more agenty than 1.1. Similarly, there are some ar­chi­tec­tures that will as­so­ci­ate with “definitely not an agent”. I con­jec­ture that some ex­hibits agent-like be­hav­ior ac­cord­ing to if it can be ac­cu­rately pre­dicted via -mor­phiza­tion for some agenty-to- ar­chi­tec­ture . Similarly, would say that ex­hibits non-agenty be­hav­ior if we can ac­cu­rately pre­dict it us­ing some non-agenty-to- ar­chi­tec­ture.

Crit­i­cally, ex­hibit­ing agent-like-to- and non-agenty-to- be­hav­ior is not mu­tu­ally ex­clu­sive, and I think this causes most of the con­fu­sion around agency. In­deed, we hu­mans seem very agenty but, at the same time, de­ter­minism im­plies that there ex­ists some hard-coded be­hav­ior that we en­act. A rock rol­ling down­hill can be viewed as merely obey­ing the non-agenty laws of physics, but what if it “wants to” get as low as pos­si­ble? And, as a re­sult, we some­times go “Hu­mans are definitely agents, and rocks are definitely­though, wait, are they?”.

If we ban the con­cept of agency, which in­ter­est­ing prob­lems re­main?

“Agency” of­ten comes up when dis­cussing var­i­ous al­ign­ment-re­lated top­ics, such as the fol­low­ing:


How do we de­tect whether performs (or ca­pa­ble of perform­ing) op­ti­miza­tion? How to de­tect this from ’s ar­chi­tec­ture (or causal ori­gin) rather than look­ing at its be­hav­ior? (This seems cen­tral to the topic of mesa-op­ti­miza­tion.)

Agent-like be­hav­ior vs agent-like architecture

Con­sider the fol­low­ing con­jec­ture: “Sup­pose some ex­hibits agent-like be­hav­ior. Does it fol­low that phys­i­cally con­tains agent-like ar­chi­tec­ture, such as the one from Ex­am­ple 1.2?”. This con­jec­ture is false—as an ex­am­ple, Q-learn­ing is a “fairly agenty” ar­chi­tec­ture that leads to in­tel­li­gent be­hav­ior. How­ever, the re­sult­ing RL “agent” has a fixed policy and thus func­tions as a large look-up table. A bet­ter ques­tion would thus be whether there ex­ist an agent-like ar­chi­tec­ture causally up­stream of . This ques­tion also has a nega­tive an­swer, as wit­nessed by the ex­am­ple of an ant colony—agent-like be­hav­ior with­out agent-like ar­chi­tec­ture, pro­duced by a “non-agenty” op­ti­miza­tion pro­cess of evolu­tion. Nonethe­less, a gen­eral ver­sion of the ques­tion re­mains: If some ex­hibits agent-like be­hav­ior, does it fol­low that there ex­ists some in­ter­est­ing phys­i­cal struc­ture[4] causally up­stream of ?[5]

Mo­ral standing

Sup­pose there is some , which I model as hav­ing some goals. When mak­ing ac­tions should I give weight to those goals? (The an­swer to this ques­tion seems more re­lated to con­cious­ness than to -mor­phiza­tion. Note also that a par­tic­u­larly in­ter­est­ing ver­sion of the ques­tion can be ob­tained by re­plac­ing “I” by “AGI”...)

PC or NPC?

When mak­ing plans, should we model as a part of the en­vi­ron­ment, or does it en­ter our game-the­o­ret­i­cal con­sid­er­a­tions? Is able to model us?

Creativity, un­bounded goals, en­vi­ron­ment-generality

In some sense, AlphaZero is an ex­tremely ca­pa­ble game-play­ing agent. On the other hand, if we gave it ac­cess to the in­ter­net[6], it wouldn’t do any­thing with it. The same can­not be said for hu­mans and un­al­igned AGIs, who would not only be able to ori­ent in this new en­vi­ron­ment but would ea­gerly ex­e­cute elab­o­rate plans to in­crease their in­fluence. How can we tell whether some is more like the former or the lat­ter?

To sum­ma­rize, I be­lieve that many ar­gu­ments and con­fu­sions sur­round­ing agency can dis­ap­pear if we ex­plic­itly use -mor­phiza­tion. This should al­low us to fo­cus on the prob­lems listed above. Most defi­ni­tions I gave are ei­ther semi-for­mal or in­for­mal, but I be­lieve they could be made fully for­mal in more spe­cific cases.

Re­gard­ing feed­back: Suges­tions for a bet­ter name for “-mor­phiza­tion” su­per-wel­come! If you know of an ap­pli­ca­tion for which such for­mal­iza­tion would be use­ful, please do let me know. Point­ing out places where you ex­pect a use­ful for­mal­iza­tion to be im­pos­si­ble is also wel­come.

  1. You might also view multi-agent sys­tems these as mono­lithic agents, but this view might of­ten give you wrong in­tu­itions. I am in­clud­ing this cat­e­gory as an ex­am­ple that—in­tu­itively—doesn’t be­long to ei­ther of the “agent” and “not-agent” cat­e­gories. ↩︎

  2. By de­fault, we do not as­sume that -mor­phiza­tion of is use­ful in any way, or even the most use­ful among all in­stances of . This goes against the in­tu­ition ac­cord­ing to which we would pick some that is close to op­ti­mal (among ) for pre­dict­ing . I am cur­rently un­sure how to for­mal­ize this in­tu­ition, apart from re­quiring that is op­ti­mal (which seems too strong a con­di­tion). ↩︎

  3. Dist­in­guish­ing be­tween “small enough” and “too big” pre­dic­tion er­rors seems non-triv­ial since differ­ent en­vi­ron­ments are nat­u­rally more difficult to pre­dict than oth­ers. For­mal­iz­ing this will likely re­quire ad­di­tional in­sights. ↩︎

  4. An ex­am­ple of such “in­ter­est­ing phys­i­cal struc­ture” would be an im­ple­men­ta­tion of an op­ti­miza­tion ar­chi­tec­ture. ↩︎

  5. Even if true, this con­jec­ture will likely re­quire some ad­di­tional as­sump­tions. More­over, I ex­pect “ran­domly-gen­er­ated look-up ta­bles that hap­pen to stum­ble upon AGI by chance” to serve as a par­tic­u­larly rele­vant coun­terex­am­ple. ↩︎

  6. What­ever that means in this case. ↩︎