This post is a more minor post, that I’m putting up to reference in other posts.

Probabilities, weights, and expectations

You’re an agent, with potential uncertainty over your reward function. You know you have to maximise

0.5R1+0.5R2

where R1 and R2 are reward functions. What do you do?

Well, how do we interpret the 0.5? Are they probabilities for which reward function is right? Or are they weights, telling you the relative importance of each one? Well, in fact:

If you won’t be learning any more information to help you distinguish between reward functions, then weights and probabilities play the same role.

Thus, if you don’t expect to learn any more reward function-relevant information, maximising reward given P(R1)=P(R2)=0.5 is the same as maximising the single reward function R3=0.5R1+0.5R2.

So, if we denote probabilities with in bold, maximising the following (given no reward-function learning) are all equivalent:

Now, given a probability distribution pR over reward functions, we can take its expectation E(pR). You can define this by talking about affine spaces and so on, but the simple version of it is: to take an expectation, rewrite every probability as a weight. So the result becomes:

If you won’t be learning any more information to help you distinguish between reward functions, then distributions with same expectation are equivalent.

Now, conservation of expected evidence is about expectations. It basically says that, if π1 and π2 are two policies the agent could take, then for the probability distribution pR,

E(pR∣π1)=E(pR∣π2).

Suppose that pR is in fact riggable, and that we wanted to “correct” it to make it unriggable. Then we would want to add a correction term for any policy π. If we took π0 as a “default” policy, we could add a correction term to pR∣π:

(pR∣π)→(pR∣π)−E(pR∣π)+E(pR∣π0).

This would have the required unriggability properties. But how do you add to a probability distribution—and how do you subtract from it?

Bur recall that unriggability only cares about expectations, and expectations treat probabilities as weights. Adding weighted reward functions is perfectly fine. Generally there will be multiple ways of doing this, mixing probabilities and weights.

For example, if (pR∣π)=0.5R1+0.5R2 and (pR∣π0)=0.75(R1−R2)+0.25R2, then we can map (pR∣π) to

## Probabilities, weights, sums: pretty much the same for reward functions

This post is a more minor post, that I’m putting up to reference in other posts.## Probabilities, weights, and expectations

You’re an agent, with potential uncertainty over your reward function. You know you have to maximise

0.5R1+0.5R2

where R1 and R2 are reward functions. What do you do?

Well, how do we interpret the 0.5? Are they probabilities for which reward function is right? Or are they weights, telling you the relative importance of each one? Well, in fact:

If you won’t be learning any more information to help you distinguish between reward functions, then weights and probabilities play the same role.

Thus, if you don’t expect to learn any more reward function-relevant information, maximising reward given P(R1)=P(R2)=0.5 is the same as maximising the single reward function R3=0.5R1+0.5R2.

So, if we denote probabilities with in bold, maximising the following (given no reward-function learning) are all equivalent:

0.5R1+0.5R21(0.5R1+0.5R2)0.25R1+0.25R2+0.5(0.5R1+0.5R2)0.5(1.5R1−0.5R2)+0.5(1.5R2−0.5R1)

Now, given a probability distribution pR over reward functions, we can take its expectation E(pR). You can define this by talking about affine spaces and so on, but the simple version of it is:

to take an expectation, rewrite every probability as a weight. So the result becomes:If you won’t be learning any more information to help you distinguish between reward functions, then distributions with same expectation are equivalent.

## Expected evidence and unriggability

We’ve defined an unriggable learning process as one that respects conservation of expected evidence.

Now, conservation of expected evidence is about expectations. It basically says that, if π1 and π2 are two policies the agent could take, then for the probability distribution pR,

E(pR ∣π1)=E(pR ∣π2).

Suppose that pR is in fact riggable, and that we wanted to “correct” it to make it unriggable. Then we would want to add a correction term for any policy π. If we took π0 as a “default” policy, we could add a correction term to pR∣π:

(pR∣π)→(pR∣π)−E(pR∣π)+E(pR∣π0).

This would have the required unriggability properties. But how do you add to a probability distribution—and how do you subtract from it?

Bur recall that unriggability only cares about expectations, and expectations treat probabilities as weights. Adding weighted reward functions is perfectly fine. Generally there will be multiple ways of doing this, mixing probabilities and weights.

For example, if (pR∣π)=0.5R1+0.5R2 and (pR∣π0)=0.75(R1−R2)+0.25R2, then we can map (pR∣π) to

1(0.75R1−0.5R2),

0.75(R1−R2)+0.25R2,

0.5(R1+R)+0.5(R2+R) with R=0.25R1−R2,

0.75R1+0.25(−2R2),

and many other options...

This multiplicity of possibilities is what I was trying to deal with in my old post about reward function translations.