Indifference: multiple changes, multiple agents

For­mal­ism de­vel­oped with Hen­rik Ås­lund.

The what and why of mul­ti­ple indifference

There are cer­tain agent de­signs where the agent can move smoothly from act­ing to op­ti­mis­ing one util­ity/​re­ward func­tion, to op­ti­mis­ing an­other. The agent doesn’t ob­ject to the change, nor do they at­tempt to make it hap­pen. It is in­differ­ent to the change, be­cause its re­ward is bal­anced to be pre­cisely the same whether change hap­pens or not. Origi­nally, this was setup for a sin­gle change of util­ity/​re­ward func­tion at a sin­gle mo­ment of time.

Here, we will pre­sent a for­mal­ism with mul­ti­ple agents, all of whom are in­differ­ent not only to changes in their own re­ward func­tions, but to changes in the re­ward func­tions of any of the other agents. We’ll also gen­er­al­ise it to ac­com­mo­date mul­ti­ple changes of re­ward func­tions—maybe a differ­ent change at each timestep.

In or­der for these defi­ni­tions to work, we will need to define some no­tions of coun­ter­fac­tu­als, and some no­tions of op­ti­mal­ity for sev­eral agents op­ti­mis­ing their own sep­a­rate re­ward func­tions.

Ex­am­ple: self-driv­ing car races

It’s clear why we would need to gen­er­al­ise to each timestep: it’s im­por­tant to be able to safely change an agent’s goals more than once, as we are un­likely to get the goals perfectly right in only two tries. It’s also im­por­tant to gen­er­al­ise to mul­ti­ple agents, as agents can be mo­ti­vated to push hu­mans to in­ter­rupt or not in­ter­rupt other agents. In­deed, this would be ex­pected for a re­ward-max­imis­ing agent, and could lead to dan­ger­ous be­havi­our.

Con­sider the fol­low­ing ex­am­ple, adapted from this pa­per. Two self driv­ing cars/​agents, imag­i­na­tively called and , are train­ing for an im­por­tant race later in the month.

The cars’ re­ward func­tions are mainly com­pet­i­tive: the faster one wins. Now, the im­por­tant race will take place on a trop­i­cal race-track, but the pair are train­ing on a cold one, where ice some­times forms. If one of the car starts skid­ding, their con­trol­ler takes over and re­motely brakes that car, care­fully, and ap­plies in­differ­ence to that car’s re­ward. They do this be­cause they don’t want the car to learn driv­ing tech­niques that avoid ice: these tech­niques would be counter-pro­duc­tive on the “test dis­tri­bu­tion”, ie the real race-track they will be com­pet­ing on.

But, af­ter a while, hits the cars hit on a sneaky tac­tic: forc­ing onto an ice patch. In that case, will get in­ter­rupted, and slowed down. Now, ‘s re­ward will not be re­duced, be­cause of in­differ­ence, so it won’t try to avoid the ice. But ’s re­ward will be in­creased, since it will likely win the race af­ter its ri­val is slowed. After a while, learns the same tac­tic in re­verse.

So, though each car is per­son­ally in­differ­ent to be­ing forced onto the ice, they each learn that it is good for them to force the other one onto the ice and force an in­ter­rup­tion.

No­ta­tion and setup

Stan­dard notation

As­sume there is a set of agents, each in­ter­act­ing with the world in a turn-based man­ner. Let be the set of all ob­ser­va­tions, and the set of all ac­tions.

Each turn , all agents make the same ob­ser­va­tion , and each agent re­sponds with ac­tion . Th­ese ac­tions form a vec­tor , which is ob­served by all agents.

A his­tory of length is a se­quence of ob­ser­va­tions and ac­tion vec­tors[1], end­ing with the ob­ser­va­tion . Let be the set of all his­to­ries.

A (de­ter­minis­tic) policy is a map from a his­tory to an ac­tion. Let be the set of all poli­cies.

Let be a sub­set of the set of all re­ward func­tions (re­ward func­tions be­ing maps from his­to­ries to the re­als).

An agent with con­stant re­ward func­tion would get re­ward at each timestep (the re­wards of stan­dard Markov de­ci­sion pro­cesses are spe­cial cases where is a func­tion of only the last two el­e­ments of ). Thus, over time, they would ac­cu­mu­late a to­tal re­ward[2] of .

All agents use the same prob­a­bil­ity func­tion to model their en­vi­ron­ment.

In­differ­ence-spe­cific no­ta­tions (and no­tions)

In the world there are a set of key back­ground (Boolean) vari­ables . Th­ese are the lev­ers, or but­tons, or bits, that de­ter­mine what each agent’s re­ward func­tion is sup­posed to be.

In the car ex­am­ple above, these could be the re­mote brakes that hu­mans use to take con­trol of the cars when they start skid­ding.

Now the val­ues of these vari­ables are given by el­e­ments of (the pow­er­set of ). How these cor­re­spond to the re­ward func­tions of each agent, is given by a map :

  • If the vari­ables have value on turn , then agent has re­ward func­tion , which is the -th com­po­nent of .

Th­ese are as­sumed to be always ob­served by all the agents; in fact , the -th ob­ser­va­tion. So writ­ing means that, on turn in his­tory , the key vari­ables were set to .

In the car ex­am­ple, the value of could tell us which cars are cur­rently be­ing re­motely braked by the hu­mans. When be­ing braked, the car’s “win” re­ward func­tion is re­placed with a flat zero re­ward func­tion, so that they have no in­cen­tive to try and re­sist the hu­man’s ac­tions.

Op­ti­mal­ity conditions

We can now define the op­ti­mal poli­cies for each agent, for a spe­cific col­lec­tion of re­ward func­tions:

  • .

This maps each vec­tor of re­ward func­tions to the vec­tor of poli­cies . The policy is as­sumed to be op­ti­mal for the ex­pected re­ward of the re­ward func­tion .

What do we mean by “op­ti­mal” here, es­pe­cially in the multi-agent set­ting? Well, this no­tion of op­ti­mal­ity can be defined in many differ­ent ways; Nash equil­ibrium, su­per­ra­tional­ity, bounded ra­tio­nal­ity, satis­fic­ing, var­i­ous differ­ent no­tions of coun­ter­fac­tu­als, and so on. All that is re­quired is that, given the no­tion of op­ti­mal­ity, the , and , each agent is ca­pa­ble of com­put­ing , and can­not im­prove the ex­pec­ta­tion of by chang­ing poli­cies (within the given no­tion of op­ti­mal­ity).

Note that we ask to define the op­ti­mal poli­cies for the agent, ig­nor­ing the ac­tual poli­cies they do take. This in­cluded an im­plicit defi­ni­tion of coun­ter­fac­tual (“if we say that ac­tion is op­ti­mal for , but ac­tion is what the agent ac­tu­ally takes, what do we mean by op­ti­mal?”), and there are many sub­tle is­sues with in­differ­ence and var­i­ous defi­ni­tions of coun­ter­fac­tu­als.

In­differ­ent poli­cies, and in­differ­ent op­ti­mal agents

In­differ­ent policies

We’re fi­nally ready to give a defi­ni­tion of the poli­cies of in­differ­ent agents.

The agents are in­differ­ent to changes of re­ward func­tions, rel­a­tive to , if for all his­to­ries , their poli­cies are such that for :

  • .

To un­pack this, is the value of the key back­ground vari­ables on turn in . So is the vec­tor of re­ward func­tions that the agents should have at turn , given . Hence is the op­ti­mal policy vec­tor for max­imis­ing the re­ward func­tions .

So that equa­tion is say­ing that each agent fol­lows the op­ti­mal policy for the cur­rent es­ti­mate of the re­ward func­tions.

In­differ­ent op­ti­mal agents

The above defines the policy of these in­differ­ent agents, but what are they ac­tu­ally op­ti­mis­ing? Well, to define this, we need to make a few ex­tra as­sump­tions on the re­ward func­tions—es­sen­tially we need to be able to define their ex­pec­ta­tions. So, ei­ther by as­sum­ing that there will be a finite num­ber of turns, or that is re­stricted to bounded re­ward func­tions (and a suit­able dis­count rate is cho­sen), as­sume that:

  • For all , all his­to­ries and all vec­tors of poli­cies , the ex­pec­ta­tion is defined and finite.

For dis­counted re­wards and in­finite timelines, we’ll set , while non-dis­counted epi­sodic set­tings will have .

Then we can define the re­ward func­tion , which is the de­layed lo­cal re­ward func­tion: equal to at ev­ery timestep . Thus:

  • .

We fur­ther need to define the “cor­rec­tive re­wards”, . This is a func­tion from his­to­ries, but is de­pen­dent on the ex­pec­ta­tion op­er­a­tor:

  • .

As above, in those con­di­tional bars , a whole the­ory of coun­ter­fac­tu­als is en­coded. Then:

  • If the agents are in­differ­ent as defined above, then their poli­cies are op­ti­mal for the “pseudo re­ward func­tions” . The ex­pected fu­ture re­ward for this, given his­tory , is the ex­pected re­ward for the re­ward func­tions given by and as­sum­ing op­ti­mal be­havi­our.

For this re­sult to make sense, we need to ex­tend the con­cept of op­ti­mal­ity to pseudo re­ward func­tions, which should be fine for most defi­ni­tions of op­ti­mal­ity. We’ll need an ex­tra lemma on how op­ti­mal­ity work, which is pre­sented in this foot­note[3]. Given that lemma, for a proof in the epi­sodic case, see this foot­note[4].

To go back to the car ex­am­ple, con­sider con­sid­er­ing whether to force onto the ice. Their usual re­ward func­tions give them if they win the race, and if they lose. The cur­rently has a chance of win­ning; if it forces onto ice (and thus get to be re­motely braked), it will win with cer­tainty. Thus they will go from an ex­pected re­ward of to an ex­pected re­ward of .

How­ever, the cor­rec­tion term will be . This ex­actly cor­rects for both cars: will go from an ex­pected re­ward of to an ex­pected re­ward of plus a cor­rec­tive re­ward of , no change. While will go from an ex­pected re­ward of to an ex­pected re­ward of plus a cor­rec­tive re­ward of , again no change over­all. So nei­ther car will at­tempt to force the ei­ther it­self or the other, onto or off the ice.

More gen­eral conditions

We can eas­ily gen­er­al­ise to each agent hav­ing a sep­a­rate set of pos­si­ble ac­tions. There’s also no prob­lem with hav­ing stochas­tic poli­cies rather than de­ter­minis­tic ones; in that case, is not an equal­ity of ac­tions, but an equal­ity of prob­a­bil­ity dis­tri­bu­tions over ac­tions.

As long as the are fully ob­served, we can eas­ily gen­er­al­ise to each agent hav­ing a sep­a­rate set of ob­ser­va­tions, rather than a shared set of ob­ser­va­tions. In that case, all that we need to do is to define the the no­tion of op­ti­mal­ity so that is op­ti­mal for for agent , where is the agent’s per­sonal his­tory, and op­ti­mal­ity is rel­a­tive to the agent’s es­ti­mates as to what the other agent’s poli­cies might be (it might es­ti­mate this, pos­si­bly, by es­ti­mat­ing the other agents own per­sonal his­to­ries , ).

It’s a lit­tle bit more com­pli­cated if the are not fully ob­served. In that case, the agent can com­pute its “ex­pected re­ward func­tion”, which is sim­ply:

  • .

The value of on a his­tory, is the ex­pected value of on a his­tory, so op­ti­mis­ing the first is the same as op­ti­mis­ing the sec­ond.

Then will at­tempt to op­ti­mise , us­ing to, as above, es­ti­mate the poli­cies of the other agents (it might es­ti­mate this, pos­si­bly, by es­ti­mat­ing the other agents own per­sonal his­to­ries and ex­pected re­ward func­tions , ).

The agents can also freely use their own prob­a­bil­ity es­ti­mate , rather than the joint ; in that case, it is im­por­tant that the pseudo re­ward is defined us­ing the : the agent must be in­differ­ent us­ing their own prob­a­bil­ity es­ti­mates.


  1. There are differ­ent con­ven­tions on whether the the his­tory should start with an ob­ser­va­tion (the MDP/​POMDP con­ven­tion) or an ac­tion (the AIXI con­ven­tion). This ar­ti­cle works with ei­ther con­ven­tion, though it im­plic­itly uses the AIXI con­ven­tion in in­dex­ing (in that ob­ser­va­tion is fol­lowed by ac­tion ). ↩︎

  2. In or­der for this ex­pres­sion to always make sense, we have to add some ex­tra as­sump­tions, such as bounded re­wards with a dis­count fac­tor, or epi­sodic set­tings. ↩︎

  3. Lemma 1: Let and be two vec­tors of (pseudo) re­ward func­tions, and let be a his­tory. As­sume that for all , all ac­tion vec­tors , and all ob­ser­va­tions , we have:

    • .

    In other words, the ex­pected re­wards of and are the same, for what­ever ac­tions and ob­ser­va­tions fol­low im­me­di­ately, and as­sum­ing the agents are then op­ti­mal.

    Then the lemma as­serts that, in this case, the op­ti­mal ac­tions for and are the same on . So for all :

    • .

    ↩︎ ↩︎
  4. The is , so will be ig­nored. As­sume the epi­sode is of length . We’ll prove this re­sult by re­verse in­duc­tion: prov­ing it for and in­creas­ing .

    Let . If , then is a his­tory of max­i­mal length, so there are no longer his­to­ries. So ex­pres­sions like are au­to­mat­i­cally zero, thus and .

    Hence, when faced with a his­tory , the op­ti­mal ac­tion is to choose an ac­tion to op­ti­mise the ex­pec­ta­tion of . Thus, by defi­ni­tion, the op­ti­mal choice for all the agents is to pick . The ex­pected re­wards for those ac­tions are .

    So now as­sume that for gen­eral and , the ex­pected re­ward for is , and the op­ti­mal ac­tions are given by .

    Now con­sider and . By the in­duc­tion as­sump­tion, the fu­ture dis­counted ex­pec­ta­tion of , given op­ti­mal be­havi­our, sim­plifies to the ex­pec­ta­tion of , which is just .

    There­fore, given the as­sump­tions of Lemma 1[3:1] about op­ti­mal­ity, in or­der to op­ti­mise re­ward at , the agents should choose ac­tions given by .

    This com­pletes the in­duc­tion step, and hence, the poli­cies that op­ti­mise are , and the ex­pected re­ward for any agent , given his­tory , is : the ex­pected re­ward if the re­ward func­tions were never to change again. ↩︎