Attainable Utility Landscape: How The World Is Changed

(This is one in­ter­pre­ta­tion of the prompt, in which you haven’t cho­sen to go to the moon. If you imag­ined your­self as more pre­pared, that’s also fine.)

If you were plopped onto the moon, you’d die pretty fast. Maybe the “die as quickly as pos­si­ble” AU is high, but not much else—not even the “live on the moon” AU! We haven’t yet re­shaped the AU land­scape on the moon to be hos­pitable to a wide range of goals. Earth is spe­cial like that.

AU land­scape as a unify­ing frame

At­tain­able util­ities are calcu­lated by wind­ing your way through pos­si­bil­ity-space, con­sid­er­ing and dis­card­ing pos­si­bil­ity af­ter pos­si­bil­ity to find the best plan you can. This frame is unify­ing.

Some­times you ad­van­tage one AU at the cost of an­other, mov­ing through the state space to­wards the best pos­si­bil­ities for one goal and away from the best pos­si­bil­ities for an­other goal. This is op­por­tu­nity cost.

Some­times you gain more con­trol over the fu­ture: most of the best pos­si­bil­ities make use of a wind­fall of cash. Some­times you act to pre­serve con­trol over the fu­ture: most Tic-Tac-Toe goals in­volve not end­ing the game right away. This is power.

Other peo­ple usu­ally ob­jec­tively im­pact you by de­creas­ing or in­creas­ing a bunch of your AUs (gen­er­ally, by chang­ing your power). This hap­pens for an ex­tremely wide range of goals be­cause of the struc­ture of the en­vi­ron­ment.

Some­times, the best pos­si­bil­ities are made un­available or wors­ened only for goals very much like yours. This is value im­pact.

Some­times a bunch of the best pos­si­bil­ities go through the same part of the fu­ture: fast travel to ran­dom places on Earth usu­ally in­volves the air­port. This is in­stru­men­tal con­ver­gence.

Ex­er­cise: Track what’s hap­pen­ing to your var­i­ous AUs dur­ing the fol­low­ing story: you win the lot­tery. Be­ing an effec­tive spender, you use most of your cash to buy a ma­jor­ity stake in a ma­jor log­ging com­pany. Two months later, the com­pany goes un­der.

Tech­ni­cal ap­pendix: AU land­scape and world state con­tain equal information

In the con­text of finite de­ter­minis­tic Markov de­ci­sion pro­cesses, there’s a won­der­ful hand­ful of the­o­rems which ba­si­cally say that the AU land­scape and the en­vi­ron­men­tal dy­nam­ics en­code each other. That is, they con­tain the same in­for­ma­tion, just with differ­ent em­pha­sis. This sup­ports think­ing of the AU land­scape as a “dual” of the world state.

Let be a re­ward­less de­ter­minis­tic MDP with finite state and ac­tion spaces , de­ter­minis­tic tran­si­tion func­tion , and dis­count fac­tor . As our in­ter­est con­cerns op­ti­mal value func­tions, we con­sider only sta­tion­ary, de­ter­minis­tic poli­cies: .

The first key in­sight is to con­sider not poli­cies, but the tra­jec­to­ries in­duced by poli­cies from a given state; to not look at the state it­self, but the paths through time available from the state. We con­cern our­selves with the pos­si­bil­ities available at each junc­ture of the MDP.

To this end, for , con­sider the map­ping of (where ); in other words, each policy maps to a func­tion map­ping each state to a dis­counted state vis­i­ta­tion fre­quency vec­tor , which we call a pos­si­bil­ity. The mean­ing of each fre­quency vec­tor is: start­ing in state and fol­low­ing policy , what se­quence of states do we visit in the fu­ture? States vis­ited later in the se­quence are dis­counted ac­cord­ing to : the se­quence would in­duce vis­i­ta­tion fre­quency on , vis­i­ta­tion fre­quency on , and vis­i­ta­tion fre­quency on .

The pos­si­bil­ity func­tion out­puts the pos­si­bil­ities available at a given state :

Put differ­ently, the pos­si­bil­ities available are all of the po­ten­tial film-strips of how-the-fu­ture-goes you can in­duce from the cur­rent state.

Pos­si­bil­ity isomorphism

We say two re­ward­less MDPs and are iso­mor­phic up to pos­si­bil­ities if they in­duce the same pos­si­bil­ities. Pos­si­bil­ity iso­mor­phism cap­tures the es­sen­tial as­pects of an MDP’s struc­ture, while be­ing in­var­i­ant to state rep­re­sen­ta­tion, state la­bel­ling, ac­tion la­bel­ling, and the ad­di­tion of su­perflu­ous ac­tions (ac­tions whose re­sults are du­pli­cated by other ac­tions available at that state). For­mally, when there ex­ists a bi­jec­tion (let­ting be the cor­re­spond­ing -by- per­mu­ta­tion ma­trix) satis­fy­ing for all .

This iso­mor­phism is a nat­u­ral con­tender[1] for the canon­i­cal (finite) MDP iso­mor­phism:

The­o­rem: and are iso­mor­phic up to pos­si­bil­ities iff their di­rected graphs are iso­mor­phic (and they have the same dis­count rate).

Rep­re­sen­ta­tion equivalence

Sup­pose I give you the fol­low­ing pos­si­bil­ity sets, each con­tain­ing the pos­si­bil­ities for a differ­ent state:

Ex­er­cise: What can you figure out about the MDP struc­ture? Hint: each en­try in the column cor­re­sponds to the vis­i­ta­tion fre­quency of a differ­ent state; the first en­try is always , sec­ond , and third .

You can figure out ev­ery­thing: , up to pos­si­bil­ity iso­mor­phism. Solu­tion here.

How? Well, the norm of the pos­si­bil­ity vec­tor is always , so you can de­duce eas­ily. The sin­gle pos­si­bil­ity state must be iso­lated, so we can mark that down in our graph. Also, it’s in the third en­try.

The other two states cor­re­spond to the “1” en­tries in their pos­si­bil­ities, so we can mark that down. The rest fol­lows straight­for­wardly.

The­o­rem: Sup­pose the re­ward­less MDP has pos­si­bil­ity func­tion . Given only ,[2] can be re­con­structed up to pos­si­bil­ity iso­mor­phism.

In MDPs, the “AU land­scape” is the set of op­ti­mal value func­tions for all re­ward func­tions over states in that MDP. If you know the op­ti­mal value func­tions for just re­ward func­tions, you can also re­con­struct the re­ward­less MDP struc­ture.[3]

From the en­vi­ron­ment (re­ward­less MDP), you can de­duce the AU land­scape (all op­ti­mal value func­tions) and all pos­si­bil­ities. From pos­si­bil­ities, you can de­duce the en­vi­ron­ment and the AU land­scape. From the AU land­scape, you can de­duce the en­vi­ron­ment (and thereby all pos­si­bil­ities).

All of these en­code the same math­e­mat­i­cal ob­ject.

Tech­ni­cal ap­pendix: Op­por­tu­nity cost

Op­por­tu­nity cost is when an ac­tion you take makes you more able to achieve one goal but less able to achieve an­other. Even this sim­ple world has op­por­tu­nity cost:

Go­ing to the green state means you can’t get to the pur­ple state as quickly.

On a deep level, why is the world struc­tured such that this hap­pens? Could you imag­ine a world with­out op­por­tu­nity cost of any kind? The an­swer, again in the re­ward­less MDP set­ting, is sim­ple: “yes, but the world would be triv­ial: you wouldn’t have any choices”. Us­ing a straight­for­ward for­mal­iza­tion of op­por­tu­nity cost, we have:

The­o­rem: Op­por­tu­nity cost ex­ists in an en­vi­ron­ment iff there is a state with more than one pos­si­bil­ity.

Philo­soph­i­cally, op­por­tu­nity cost ex­ists when you have mean­ingful choices. When you make a choice, your nec­es­sar­ily mov­ing away from some po­ten­tial fu­ture but to­wards an­other; since you can’t be in more than one place at the same time, op­por­tu­nity cost fol­lows. Equiv­a­lently, we as­sumed the agent isn’t in­finitely far­sighted (); if it were, it would be pos­si­ble to be in “more than one place at the same time”, in a sense (thanks to Ro­hin Shah for this in­ter­pre­ta­tion).

While un­der­stand­ing op­por­tu­nity cost may seem like a side-quest, each in­sight is an­other brick in the ed­ifice of our un­der­stand­ing of the in­cen­tives of goal-di­rected agency.

Notes

  • Just as game the­ory is a great ab­strac­tion for mod­el­ling com­pet­i­tive and co­op­er­a­tive dy­nam­ics, AU land­scape is great for think­ing about con­se­quences: it au­to­mat­i­cally ex­cludes ir­rele­vant de­tails about the world state. We can think about the effects of events with­out need­ing a spe­cific util­ity func­tion or on­tol­ogy to eval­u­ate them. In multi-agent sys­tems, we can straight­for­wardly pre­dict the im­pact the agents have on each other and the world.

  • “Ob­jec­tive im­pact to a lo­ca­tion” means that agents whose plans route through the lo­ca­tion tend to be ob­jec­tively im­pacted.

  • The land­scape is not the ter­ri­tory: AU is calcu­lated with re­spect to an agent’s be­liefs, not nec­es­sar­ily with re­spect to what re­ally “could” or will hap­pen.


  1. The pos­si­bil­ity iso­mor­phism is new to my work, as are all other re­sults shared in this post. This ap­par­ent lack of ba­sic the­ory re­gard­ing MDPs is strange; even stranger, this ab­sence was ac­tu­ally pointed out in two pub­lished pa­pers!

    I find the ex­ist­ing MDP iso­mor­phisms/​equiv­alences to be pretty lack­ing. The de­tails don’t fit in this mar­gin, but per­haps in a pa­per at some point. If you want to coau­thor this (mainly com­piling re­sults, find­ing a venue, and re­spond­ing to re­views), let me know and I can share what I have so far (ex­tend­ing well be­yond the the­o­rems in my re­cent work on power). ↩︎

  2. In fact, you can re­con­struct the en­vi­ron­ment us­ing only a limited sub­set of pos­si­bil­ities: the non-dom­i­nated pos­si­bil­ities. ↩︎

  3. As a ten­sor, the tran­si­tion func­tion has size , while the AU land­scape rep­re­sen­ta­tion only has size . How­ever, if you’re just rep­re­sent­ing as a tran­si­tion func­tion, it has size . ↩︎