Formalism developed with Henrik Åslund.

The what and why of multiple indifference

There are certain agent designs where the agent can move smoothly from acting to optimising one utility/reward function, to optimising another. The agent doesn’t object to the change, nor do they attempt to make it happen. It is indifferent to the change, because its reward is balanced to be precisely the same whether change happens or not. Originally, this was setup for a single change of utility/reward function at a single moment of time.

Here, we will present a formalism with multiple agents, all of whom are indifferent not only to changes in their own reward functions, but to changes in the reward functions of any of the other agents. We’ll also generalise it to accommodate multiple changes of reward functions—maybe a different change at each timestep.

In order for these definitions to work, we will need to define some notions of counterfactuals, and some notions of optimality for several agents optimising their own separate reward functions.

Example: self-driving car races

It’s clear why we would need to generalise to each timestep: it’s important to be able to safely change an agent’s goals more than once, as we are unlikely to get the goals perfectly right in only two tries. It’s also important to generalise to multiple agents, as agents can be motivated to push humans to interrupt or not interrupt other agents. Indeed, this would be expected for a reward-maximising agent, and could lead to dangerous behaviour.

Consider the following example, adapted from this paper. Two self driving cars/agents, imaginatively called $C a r^{1}$ and $C a r^{2}$ , are training for an important race later in the month.

The cars’ reward functions are mainly competitive: the faster one wins. Now, the important race will take place on a tropical race-track, but the pair are training on a cold one, where ice sometimes forms. If one of the car starts skidding, their controller takes over and remotely brakes that car, carefully, and applies indifference to that car’s reward. They do this because they don’t want the car to learn driving techniques that avoid ice: these techniques would be counter-productive on the “test distribution”, ie the real race-track they will be competing on.

But, after a while, $C a r^{2}$ hits the cars hit on a sneaky tactic: forcing $C a r^{1}$ onto an ice patch. In that case, $C a r^{1}$ will get interrupted, and slowed down. Now, $C a r^{1}$ ‘s reward will not be reduced, because of indifference, so it won’t try to avoid the ice. But $C a r^{2}$ ’s reward will be increased, since it will likely win the race after its rival is slowed. After a while, $C a r^{1}$ learns the same tactic in reverse.

So, though each car is personally indifferent to being forced onto the ice, they each learn that it is good for them to force the other one onto the ice and force an interruption.

Notation and setup

Standard notation

Assume there is a set ${A g^{i}}_{1 \leq i \leq n}$ of $n$ agents, each interacting with the world in a turn-based manner. Let $O$ be the set of all observations, and $A$ the set of all actions.

Each turn $t$ , all agents make the same observation $o_{t} \in O$ , and each agent $A g^{i}$ responds with action $a_{t + 1} \in A$ . These actions form a vector ${¯ ¯ ¯ a}_{t} = (a_{t}^{1}, a_{t}^{2}, \dots a_{t}^{n}) \in A^{n}$ , which is observed by all agents.

A history $h_{t}$ of length $t$ is a sequence of observations and action vectors^[1], ending with the observation $o_{t}$ . Let $H$ be the set of all histories.

A (deterministic) policy $π : H \to A$ is a map from a history to an action. Let $Π$ be the set of all policies.

Let $R$ be a subset of the set ${R ∣ R : H \to R}$ of all reward functions (reward functions being maps from histories to the reals).

An agent with constant reward function $R$ would get reward $R (h_{t})$ at each timestep $t$ (the rewards of standard Markov decision processes are special cases where $R (h_{t})$ is a function of only the last two elements of $h_{t}$ ). Thus, over time, they would accumulate a total reward^[2] of $\sum_{t \geq 1} R (h_{t})$ .

All agents use the same probability function $P$ to model their environment.

Indifference-specific notations (and notions)

In the world there are a set of key background (Boolean) variables $B$ . These are the levers, or buttons, or bits, that determine what each agent’s reward function is supposed to be.

In the car example above, these could be the remote brakes that humans use to take control of the cars when they start skidding.

Now the values of these variables are given by elements $b$ of $2^{B}$ (the powerset of $B$ ). How these correspond to the reward functions of each agent, is given by a map $r : 2^{B} \to R^{n}$ :

If the variables have value $b_{t} \in 2^{B}$ on turn $t$ , then agent $A g^{i}$ has reward function $r (b_{t})^{i}$ , which is the $i$ -th component of $r (b_{t}) = (r (b_{t})^{1}, r (b_{t})^{2}, \dots r (b_{t})^{n})$ .

These $b_{t}$ are assumed to be always observed by all the agents; in fact $b_{t} \subset o_{t}$ , the $t$ -th observation. So writing $b_{t} \in h_{t^{'}}$ means that, on turn $t$ in history $h_{t^{'}}$ , the key variables were set to $b_{t}$ .

In the car example, the value of $b_{t}$ could tell us which cars are currently being remotely braked by the humans. When being braked, the car’s “win” reward function is replaced with a flat zero reward function, so that they have no incentive to try and resist the human’s actions.

Optimality conditions

We can now define the optimal policies for each agent, for a specific collection of reward functions:

$π_{*} : R^{n} \to Π^{n}$ .

This $π_{*}$ maps each vector $¯ ¯¯ ¯ R$ of reward functions to the vector of policies $¯ ¯ ¯ π = ({¯ ¯ ¯ π}_{*}^{1} (¯ ¯¯ ¯ R), \dots {¯ ¯ ¯ π}_{*}^{n} (¯ ¯¯ ¯ R))$ . The policy $π_{*}^{i}$ is assumed to be optimal for the expected reward of the reward function ${¯ ¯¯ ¯ R}^{i}$ .

What do we mean by “optimal” here, especially in the multi-agent setting? Well, this notion of optimality can be defined in many different ways; Nash equilibrium, superrationality, bounded rationality, satisficing, various different notions of counterfactuals, and so on. All that is required is that, given the notion of optimality, the $¯ ¯¯ ¯ R$ , and $h \in H$ , each agent $A g^{i}$ is capable of computing $π_{*} (¯ ¯¯ ¯ R)^{i}$ , and cannot improve the expectation of ${¯ ¯¯ ¯ R}^{i}$ by changing policies (within the given notion of optimality).

Note that we ask to define the optimal policies for the agent, ignoring the actual policies they do take. This included an implicit definition of counterfactual (“if we say that action $a$ is optimal for $R$ , but action $a^{'}$ is what the agent actually takes, what do we mean by optimal?”), and there are many subtle issues with indifference and various definitions of counterfactuals.

Indifferent policies, and indifferent optimal agents

Indifferent policies

We’re finally ready to give a definition of the policies of indifferent agents.

The agents ${A g^{i}}$ are indifferent to changes of reward functions, relative to $r$ , if for all histories $h_{t}$ , their policies ${¯ ¯ ¯ π}_{i n d}^{i}$ are such that for $b_{t} \in h_{t}$ :

${¯ ¯ ¯ π}_{i n d}^{i} (h_{t}) = π_{*}^{i} (r (b_{t})) (h_{t})$ .

To unpack this, $b_{t}$ is the value of the key background variables on turn $t$ in $h_{t}$ . So $r (b_{t})$ is the vector of reward functions that the agents should have at turn $t$ , given $h_{t}$ . Hence $π_{*} (r (b_{t}))$ is the optimal policy vector for maximising the reward functions $r (b_{t})$ .

So that equation is saying that each agent follows the optimal policy for the current estimate of the reward functions.

Indifferent optimal agents

The above defines the policy of these indifferent agents, but what are they actually optimising? Well, to define this, we need to make a few extra assumptions on the reward functions—essentially we need to be able to define their expectations. So, either by assuming that there will be a finite number of turns, or that $R$ is restricted to bounded reward functions (and a suitable discount rate is chosen), assume that:

For all $R \in R$ , all histories $h_{t} \in H$ and all vectors of policies $¯ ¯ ¯ π \in Π^{n}$ , the expectation $E [R ∣ h_{t}, ¯ ¯ ¯ π] = \sum_{t^{'} > t, h_{t^{'}} \in H_{t^{'}}} P (h_{t^{'}} ∣ h_{t}, ¯ ¯ ¯ π) γ^{t^{'} - (t + 1)} R (h_{t})$ is defined and finite.

For discounted rewards and infinite timelines, we’ll set $0 \leq γ < 1$ , while non-discounted episodic settings will have $γ = 1$ .

Then we can define the reward function ${ˆ R}^{i}$ , which is the delayed local reward function: equal to $r (b_{t - 1})^{i}$ at every timestep $t$ . Thus:

${ˆ R}^{i} (h_{t}) = r (b_{t - 1})^{i} (h_{t})$ .

We further need to define the “corrective rewards”, $C : H \to R^{n}$ . This is a function from histories, but is dependent on the expectation operator:

$C^{i} (h_{t}) = E [r (b_{t - 1})^{i} ∣ h_{t}, π_{*} (r (b_{t - 1}))] - E [r (b_{t})^{i} ∣ h_{t}, π_{*} (r (b_{t}))]$ .

As above, in those conditional bars $∣$ , a whole theory of counterfactuals is encoded. Then:

If the agents are indifferent as defined above, then their policies ${¯ ¯ ¯ π}_{i n d}^{i}$ are optimal for the “pseudo reward functions” $ˆ R + γ C$ . The expected future reward for this, given history $h_{t}$ , is the expected reward for the reward functions given by $r (b_{t})$ and assuming optimal behaviour.

For this result to make sense, we need to extend the concept of optimality to pseudo reward functions, which should be fine for most definitions of optimality. We’ll need an extra lemma on how optimality work, which is presented in this footnote^[3]. Given that lemma, for a proof in the episodic case, see this footnote^[4].

To go back to the car example, consider $C a r^{2}$ considering whether to force $C a r^{1}$ onto the ice. Their usual reward functions give them $1$ if they win the race, and $0$ if they lose. The $C a r^{2}$ currently has a $40 %$ chance of winning; if it forces $C a r^{1}$ onto ice (and thus get $C a r^{1}$ to be remotely braked), it will win with certainty. Thus they will go from an expected reward of $0.4$ to an expected reward of $1$ .

However, the correction term $C$ will be $(- 0 + 0.6, - 1 + 0.4)$ . This exactly corrects for both cars: $C a r^{1}$ will go from an expected reward of $0.6$ to an expected reward of $0$ plus a corrective reward of $0.6$ , no change. While $C a r^{2}$ will go from an expected reward of $0.4$ to an expected reward of $1$ plus a corrective reward of $- 0.6$ , again no change overall. So neither car will attempt to force the either itself or the other, onto or off the ice.

More general conditions

We can easily generalise to each agent having a separate set of possible actions. There’s also no problem with having stochastic policies rather than deterministic ones; in that case, ${¯ ¯ ¯ π}_{i n d}^{i} (h_{t}) = π_{*}^{i} (r (b_{t})) (h_{t})$ is not an equality of actions, but an equality of probability distributions over actions.

As long as the $b_{t}$ are fully observed, we can easily generalise to each agent having a separate set of observations, rather than a shared set of observations. In that case, all that we need to do is to define the the notion of optimality so that $π_{*}^{i} (¯ ¯¯ ¯ R) (h_{t}^{i})$ is optimal for ${¯ ¯¯ ¯ R}^{i}$ for agent $A g^{i}$ , where $h_{t}^{i}$ is the agent’s personal history, and optimality is relative to the agent’s estimates as to what the other agent’s policies might be (it might estimate this, possibly, by estimating the other agents own personal histories $h_{t}^{j}$ , $j \neq i$ ).

It’s a little bit more complicated if the $b_{t}$ are not fully observed. In that case, the agent $A g^{i}$ can compute its “expected reward function”, which is simply:

$E R^{i} (h_{t}^{i}) = \sum_{b \in 2^{B}} r (b) P (b_{t} = b ∣ h_{t}^{i})$ .

The value of $E R^{i}$ on a history, is the expected value of $r^{i} (b_{t})$ on a history, so optimising the first is the same as optimising the second.

Then $A g^{i}$ will attempt to optimise $E R^{i}$ , using $h_{t}^{i}$ to, as above, estimate the policies of the other agents (it might estimate this, possibly, by estimating the other agents own personal histories $h_{t}^{j}$ and expected reward functions $E R^{j}$ , $j \neq i$ ).

The agents can also freely use their own probability estimate $P^{i}$ , rather than the joint $P$ ; in that case, it is important that the pseudo reward $C$ is defined using the $P^{i}$ : the agent must be indifferent using their own probability estimates.

↩︎
There are different conventions on whether the the history should start with an observation (the MDP/POMDP convention) or an action (the AIXI convention). This article works with either convention, though it implicitly uses the AIXI convention in indexing (in that observation $o_{t}$ is followed by action $a_{t + 1}$ ).
↩︎
In order for this expression to always make sense, we have to add some extra assumptions, such as bounded rewards with a discount factor, or episodic settings.
↩︎↩︎
Lemma 1: Let $¯ ¯¯ ¯ R$ and $¯ ¯¯ ¯ Q$ be two vectors of (pseudo) reward functions, and let $h_{t}$ be a history. Assume that for all $i$ , all action vectors ${¯ ¯ ¯ a}_{t + 1}$ , and all observations $o_{t + 1}$ , we have:
- ${¯ ¯¯ ¯ R}^{i} (h_{t} {¯ ¯ ¯ a}_{t + 1} o_{t + 1}) + γ E [{¯ ¯¯ ¯ R}^{i} ∣ h_{t} {¯ ¯ ¯ a}_{t + 1} o_{t + 1}, π_{*} (¯ ¯¯ ¯ R)] = {¯ ¯¯ ¯ Q}^{i} (h_{t} {¯ ¯ ¯ a}_{t + 1} o_{t + 1}) + γ E [{¯ ¯¯ ¯ Q}^{i} ∣ h_{t} {¯ ¯ ¯ a}_{t + 1} o_{t + 1}, π_{*} (¯ ¯¯ ¯ Q)]$ .
In other words, the expected rewards of $¯ ¯¯ ¯ R$ and $¯ ¯¯ ¯ Q$ are the same, for whatever actions and observations follow immediately, and assuming the agents are then optimal.

Then the lemma asserts that, in this case, the optimal actions for $¯ ¯¯ ¯ R$ and $¯ ¯¯ ¯ Q$ are the same on $h_{t}$ . So for all $i$ :
- $π_{*} (¯ ¯¯ ¯ R)^{i} (h_{t}) = π_{*} (¯ ¯¯ ¯ Q)^{i} (h_{t})$ .
↩︎
The $γ$ is $1$ , so will be ignored. Assume the episode is of length $l$ . We’ll prove this result by reverse induction: proving it for $l - m$ and increasing $m$ .

Let $t = l - 0$ . If $m = 0$ , then $h_{t}$ is a history of maximal length, so there are no longer histories. So expressions like $E [R ∣ h_{t}, ¯ ¯ ¯ π]$ are automatically zero, thus $C (h_{t}) = 0$ and $({¯ ¯¯ ¯ R}^{i} + C^{i}) (h_{t}) = r (b_{t - 1})^{i} (h_{t})$ .

Hence, when faced with a history $h_{t - 1}$ , the optimal action is to choose an action $a_{t}^{i}$ to optimise the expectation of $r (b_{t - 1})^{i}$ . Thus, by definition, the optimal choice for all the agents is to pick $π_{*} (r (b_{t - 1})) (h_{t - 1})$ . The expected rewards for those actions are $E [r (b_{t - 1})^{i} (h_{t}) ∣ h_{t - 1}, π_{*} (r (b_{t - 1}))]$ .

So now assume that for general $m$ and $t = l - m$ , the expected reward for $A g^{i}$ is $E [r (b_{t - 1})^{i} (h_{t}) ∣ h_{t - 1}, π_{*} (r (b_{t - 1}))]$ , and the optimal actions are given by $π_{*} (r (b_{t - 1})) (h_{t - 1})$ .

Now consider $m + 1$ and $h_{t - 2}$ . By the induction assumption, the future discounted expectation of ${ˆ R}^{i} + C^{i}$ , given optimal behaviour, simplifies to the expectation of ${ˆ R}^{i} (h_{t - 1}) + E [r (b_{t - 2})^{i} ∣ h_{t - 1}, π_{*} (r (b_{t - 2}))]$ , which is just $E [r (b_{t - 2})^{i} ∣ h_{t - 2}, π_{*} (r (b_{t - 2}))]$ .

Therefore, given the assumptions of Lemma 1^[3:1] about optimality, in order to optimise reward at $h_{t - 2}$ , the agents should choose actions given by $π_{*} (r (b_{t - 2})$ .

This completes the induction step, and hence, the policies that optimise $¯ ¯¯ ¯ R + C$ are $({¯ ¯ ¯ π}_{i n d}^{1}, \dots, {¯ ¯¯¯ ¯ p i}_{i n d}^{n})$ , and the expected reward for any agent $A g^{i}$ , given history $h_{t}$ , is $E [r (b_{t}) ∣ h_{t}, π_{*} (r (b_{t}))]$ : the expected reward if the reward functions were never to change again.

Indifference: multiple changes, multiple agents