Then the causal graph for the “Pitfalls” approach is, in plate notation (which basically means that, for every value of j from 1 to n, the graph inside the rectangle is true):

The R is the set of reward functions (mapping “complete” histories hn of length n to real numbers), the ρ tells you which reward is correct, conditional on complete histories, and r is the final reward.

In order to move to the reward tampering formalism, we’ll have to generalise the R and ρ, just a bit. We’ll allow R to take partial histories - hj shorter than hn - and return a reward. Similarly, we’ll generalise ρ to a conditional distribution on R, conditional on all histories hj, not just on complete histories.

This leads to the following graph:

This graph is now general enough to include reward tampering formalism.

States, data, and actions

In reward tampering formalism, “observations” (oj) decompose into two pieces: states (Sj) and data (Dj). The idea is that data informs you about the reward function, while states get put into the reward function to get the actual reward.

So we can model this as this causal graph (adapted from graph 10b, page 22; this is a slight generalisation, as I haven’t assumed Markovian conditions):

Inside the rectangle, the histories split into data (D1:j), states (S1:j), and actions (a1:j). The reward function is defined by the data only, while the reward comes from this reward function and from the states only—actions don’t directly affect these (though they can indirectly affect them by deciding what states and data come up, of course). Note that in the reward tampering paper, the authors don’t distinguish explicitly between Rj and rj, but they seem to do so implicitly.

Finally, ΘR∗ is the “user’s reward function”, which the agent is estimating via D1:j; this connects to the data only.

Almost all of the probability distributions at each node are “natural” ones that are easy to understand. For example, there are arrows into rj (the reward) from Rj (the reward function) and S1:j (the states history); the “conditional distribution” of rj is just “apply Rj to S1:j. The environment, action, and history naturally provide the next observations (state and data).

Two arrows point to more complicated relations: the arrow from ΘR∗ to Dj, and that from D1:j to R. The two are related; the data Dj is supposed to tell us about the user’s true reward function, while this information informs the choice of R.

But the fact that the nodes and the probability distribution have been “designed” this way doesn’t affect the agent. It has a fixed process Prt(R∣D1:j) for estimating R from D1:j (Prt stands for the probability function for the reward tampering formalism). It has access to aj, Dj, and Sj (and their histories) as well as its own policy, but has no direct access to μ or ΘR∗.

In fact, from the agent’s perspective, ΘR∗ is essentially part of μ, the environment, though focusing on the Dj only.

States and actions in “Pitfalls” formalism

Now, can we put this into the “Pitfalls” formalism? It seems we can, as so:

All conditional probability distributions in this graph are natural.

This graph look very similar to the “reward tampering” one, with the exception of ρj and ΘR∗, pointing at Rj and Dj respectively.

In fact, ρj play the role of Prt(R∣D1:j) in that, for Plp the probability distribution for learning process,

Plp(R∣D1:j,ρj)=Prt(R∣D1:j).

Note that Plp in that expression is natural and simple, while Prt is complex; essentially Prt carries the same information as ρj.

The environment μlp of the learning process plays the same role as the combined μrt and Θ∗R from the reward tampering formalism.

So the isomorphism between the two approaches is, informally speaking:

On reward functions conditional on histories, Prt↔ρ.

μlp↔(μrt,Θ∗R).

Uninfluenceable similarities

If we make the processes uninfluenceable (a concept that exists for both formalisms), the causal graphs look even more similar:

Here the pair (μlp,η), for the learning process, play exactly the same role as the pair^{[1]}(μrt,ΘR∗), for reward tampering: determining reward functions and observations.

There is an equivalence between the pairs, but not between the individual elements; thus μlp carries more information than μrt, while η carries less information than ΘR∗. ↩︎

## Comparing reward learning/reward tampering formalisms

## Contrasting formalisms

Here I’ll contrast the approach we’re using in using in Pitfalls of Learning a Reward Online (summarised here), with that used by Tom Everitt and Marcu Hutter in the conceptually similar Reward Tampering Problems and Solutions in Reinforcement Learning. In the following, histories hi are sequences of actions a and observations o; thus hi=a1o1a2o2…aioi. The agent’s policy is given by π, the environment is given by μ.

Then the causal graph for the “Pitfalls” approach is, in plate notation (which basically means that, for every value of j from 1 to n, the graph inside the rectangle is true):

The R is the set of reward functions (mapping “complete” histories hn of length n to real numbers), the ρ tells you which reward is correct, conditional on complete histories, and r is the final reward.

In order to move to the reward tampering formalism, we’ll have to generalise the R and ρ, just a bit. We’ll allow R to take partial histories - hj shorter than hn - and return a reward. Similarly, we’ll generalise ρ to a conditional distribution on R, conditional on all histories hj, not just on complete histories.

This leads to the following graph:

This graph is now general enough to include reward tampering formalism.

## States, data, and actions

In reward tampering formalism, “observations” (oj) decompose into two pieces: states (Sj) and data (Dj). The idea is that data informs you about the reward function, while states get put into the reward function to get the actual reward.

So we can model this as this causal graph (adapted from graph 10b, page 22; this is a slight generalisation, as I haven’t assumed Markovian conditions):

Inside the rectangle, the histories split into data (D1:j), states (S1:j), and actions (a1:j). The reward function is defined by the data only, while the reward comes from this reward function and from the states only—actions don’t directly affect these (though they can indirectly affect them by deciding what states and data come up, of course). Note that in the reward tampering paper, the authors don’t distinguish explicitly between Rj and rj, but they seem to do so implicitly.

Finally, ΘR∗ is the “user’s reward function”, which the agent is estimating via D1:j; this connects to the data only.

Almost all of the probability distributions at each node are “natural” ones that are easy to understand. For example, there are arrows into rj (the reward) from Rj (the reward function) and S1:j (the states history); the “conditional distribution” of rj is just “apply Rj to S1:j. The environment, action, and history naturally provide the next observations (state and data).

Two arrows point to more complicated relations: the arrow from ΘR∗ to Dj, and that from D1:j to R. The two are related; the data Dj is supposed to tell us about the user’s true reward function, while this information informs the choice of R.

But the fact that the nodes and the probability distribution have been “designed” this way doesn’t affect the agent. It has a fixed process Prt(R∣D1:j) for estimating R from D1:j (Prt stands for the probability function for the reward tampering formalism). It has access to aj, Dj, and Sj (and their histories) as well as its own policy, but has no direct access to μ or ΘR∗.

In fact, from the agent’s perspective, ΘR∗ is essentially part of μ, the environment, though focusing on the Dj only.

## States and actions in “Pitfalls” formalism

Now, can we put this into the “Pitfalls” formalism? It seems we can, as so:

All conditional probability distributions in this graph are natural.

This graph look very similar to the “reward tampering” one, with the exception of ρj and ΘR∗, pointing at Rj and Dj respectively.

In fact, ρj play the role of Prt(R∣D1:j) in that, for Plp the probability distribution for learning process,

Plp(R∣D1:j,ρj)=Prt(R∣D1:j).

Note that Plp in that expression is natural and simple, while Prt is complex; essentially Prt carries the same information as ρj.

The environment μlp of the learning process plays the same role as the combined μrt and Θ∗R from the reward tampering formalism.

So the isomorphism between the two approaches is, informally speaking:

On reward functions conditional on histories, Prt↔ρ.

μlp↔(μrt,Θ∗R).

## Uninfluenceable similarities

If we make the processes uninfluenceable (a concept that exists for both formalisms), the causal graphs look even more similar:

Here the pair (μlp,η), for the learning process, play exactly the same role as the pair

^{[1]}(μrt,ΘR∗), for reward tampering: determining reward functions and observations.There is an equivalence between the pairs, but not between the individual elements; thus μlp carries more information than μrt, while η carries less information than ΘR∗. ↩︎