# Utility versus Reward function: partial equivalence

A re­ward func­tion is defined over past se­quences of ac­tions and ob­ser­va­tions. When the agent chooses an ac­tion, and gets an ob­ser­va­tion, they re­ceive a re­ward that is a func­tion of that ob­ser­va­tion and all pre­vi­ous ob­ser­va­tions and ac­tions.

A util­ity func­tion is defined over states of the world. You can take ac­tions to in­crease or de­crease the prob­a­bil­ity of cer­tain states, thus in­creas­ing ex­pected util­ity, but you don’t ac­tu­ally “re­ceive” any util­ity.

Are these differ­ent ob­jects, or are they ac­tu­ally the same thing? This would be good to know, as most of the back­ground knowl­edge of MIRI and similar AI safety groups is for util­ity func­tions, while re­ward func­tions are preva­lent in re­in­force­ment learn­ing.

The sum­mary of this post is:

• For finite hori­zons, re­ward and util­ity func­tions are equiv­a­lent.

• For in­finite hori­zons, ev­ery bounded dis­counted re­ward func­tion is equiv­a­lent with a bounded util­ity func­tion. But not all bounded util­ity func­tions have a cor­re­spond­ing re­ward func­tion. Even if they do, the re­ward func­tion may not be bounded.

## Formalism

Let be the set of ac­tions an agent can take, and the set of ob­ser­va­tions. As­sume both sets are finite. Let be the set of his­to­ries (se­quences of ob­ser­va­tions and ac­tions) of an agent.

Let be the (pos­si­bly in­finite) set of wor­lds. Note that a world in­cludes the full set of ob­ser­va­tion his­tory for the agent (since the agent is part of the world). There­fore the wor­lds are strat­ified by his­to­ries; for any , there is a sub­set con­sist­ing of all wor­lds with his­tory .

Then a re­ward func­tion is a func­tion from his­to­ries to real num­bers, while a util­ity func­tion is a func­tion from wor­lds to real num­bers:

Re­wards and util­ity func­tions are bounded if their image is in a bounded sub­set of ; with­out loss of gen­er­al­ity, this means there ex­ists an such that the image of (or ) is con­tained in for all (or ).

A policy for an agent is a map from his­to­ries to a prob­a­bil­ity dis­tri­bu­tion over ac­tions; so , for the space of prob­a­bil­ity dis­tri­bu­tions over ac­tions.

Even for a util­ity-based agent, these are the only poli­cies available. This is be­cause all the in­for­ma­tion (apart from the prior) that the agent gets from the out­side world is given by its ob­ser­va­tions.

## The value func­tions

Let be the prob­a­bil­ity es­ti­ma­tor used by the agent. We’ll as­sume for the mo­ment that it’s log­i­cally om­ni­scient and Bayesian. This in­cor­po­rates a prior, by, for ex­am­ple, ap­ply­ing it un­con­di­tion­ally to wor­lds or his­to­ries - and .

Given a his­tory and a policy , we can com­pute the con­di­tional prob­a­bil­ity of a world or a his­tory; des­ig­nate these by and .

Given these prob­a­bil­ities, we can then com­pute ex­pected val­ues. For ex­am­ple, the ex­pected util­ity given and is:

We’ll also des­ig­nate this quan­tity by the ex­pected value func­tion . If is bounded in , then so is this value func­tion (though the con­verse need not be true).

For re­wards, let be the set of his­to­ries that have ac­tions and ob­ser­va­tions. We’ll say that has length . Let be the the first ac­tions and ob­ser­va­tions of the his­tory . If the agent knows it will only make ob­ser­va­tions and ac­tions, then the fu­ture re­ward has an ex­pected value func­tion. For and with a his­tory of length , this is:

If the agent ex­pects it could make ar­bi­trar­ily many ob­ser­va­tions, then given a dis­count func­tion , there is the ex­pected dis­counted fu­ture re­ward of:

## Equiv­alence for finite horizons

As­sume the agent knows it will only make ob­ser­va­tions and ac­tions. The form a par­ti­tion of the pos­si­ble wor­lds in : ev­ery pos­si­ble world is in ex­actly one of those sub­sets.

Then as­sume that is given, and define, for :

Then:

• The­o­rem 1: On his­tory , and differ by a con­stant that is a func­tion of only, not of . Con­se­quently, an -max­imiser and a -max­imiser will choose the same poli­cies.

All the proofs are given in the ap­pendix at the end of the post.

Now, con­versely, as­sume is given, but is not. If , then is in­de­pen­dent of , since the agent will never make any more ac­tions.

Then fix any policy , for a his­tory of length , define the re­ward by:

For , define by:
• The­o­rem 2: On his­tory , and differ by a con­stant that is a func­tion of and only, not of . Con­se­quently, an -max­imiser and a -max­imiser will choose the same poli­cies.

Both of these con­struc­tions are non-unique; for ex­am­ple, could vary across the differ­ent wor­lds of , as long as its ex­pec­ta­tion on that set is the same. And could have differ­ent re­wards added at one step, as long as it is sub­tracted at a later step, or could use a differ­ent .

## (In)Equiv­alence for in­finite horizons

If the agent ex­pects it could make ar­bi­trar­ily many ob­ser­va­tions, then we can still define a good from a given . Let be the pos­si­ble in­finite his­to­ries; then the sets form a par­ti­tion of . Then for any fixed define:

and we will get:

• The­o­rem 3: If is bounded, then is well-define and bounded, and on his­tory , and are re­lated by a pos­i­tive af­fine trans­for­ma­tion that is a func­tion of and only, not of .Con­se­quently, a -dis­counted -max­imiser and a -max­imiser will choose the same poli­cies.

The con­verse is more tricky. Fix any policy as be­fore, and, as be­fore, for a his­tory of length , define the re­ward by:

For , define by:
To make this work, we need to put some con­di­tions on . The con­di­tion that we need is that, even­tu­ally, the fu­ture ac­tions (hence the fu­ture policy) don’t mat­ter much. Then we say that asymp­tot­i­cally ig­nores the fu­ture, if there ex­ists a func­tion , for any , and any poli­cies and ,

and

Then, as be­fore, we can show that:
• The­o­rem 4: If asymp­tot­i­cally ig­nores the fu­ture, then on his­tory , and are re­lated by a pos­i­tive af­fine trans­for­ma­tion that is a func­tion of , , and only, not of . Con­se­quently, a -dis­counted -max­imiser and a -max­imiser will choose the same poli­cies. Even if is bounded, need not be.

## A util­ity counterexample

So, what kind of util­ity func­tion can­not be made into a re­ward func­tion in the above way?

Well, as­sume there are two ac­tions, and , and that is if the agent only chooses , and if it ever choose . Let be the policy that always chooses (all that’s needed, in fact, is that it even­tu­ally chooses with prob­a­bil­ity ).

Then is always zero, as is . Thus is also always zero. And this de­spite the fact that there ex­ists a policy that gets util­ity : namely the “always choose ” policy.

## Does it make a differ­ence in prac­tice?

In or­der to reach the equiv­alences above, the value func­tions need to be ex­actly calcu­lated, mean­ing that the prob­a­bil­ity es­ti­ma­tor needs to be perfect. Thus the equiv­alence is only es­tab­lished for log­i­cally om­ni­scient agents.

In prac­tice, util­ity func­tions are most use­ful when we know the ideal out­come states, and now the challenge is to de­sign an agent that gets to them. Re­ward func­tions are most use­ful when we know the best lo­cal moves for the agent to make, but not nec­es­sar­ily the best out­comes.

## Ap­pendix: Proofs

This gives the proofs of the four the­o­rem above. They will pro­ceed by ex­press­ing the re­la­tion­ship be­tween the two rele­vant value func­tions.

The­o­rem 1:

If , then is con­stant on , so we can talk about (which is ). Then if is of length :

since be­ing non-zero means that the ini­tial ac­tions and ob­ser­va­tions of are the same as those of , and is the same as : the prob­a­bil­ity that we are in a world with is the same as the prob­a­bil­ity of ob­serv­ing .

Be­cause is a func­tion purely of the past, this means that a max­imiser will be­have the same way as a max­imiser.

The­o­rem 2:

If is a his­tory of length , then:

be­cause if and , then , , and .

The­o­rem 3:

If is of length , then

similarly to the proof of The­o­rem 1. Here, we have used the fact that the ac­tions and ob­ser­va­tions be­yond the -th are ir­rele­vant to , in or­der to amalga­mate all that have the same first ac­tions and ob­ser­va­tions, and then in­ter­change the limit with the finite sum over all the his­to­ries in .

The­o­rem 4:

Be­cause asymp­tot­i­cally ig­nores the fu­ture, the term can be rewrit­ten as , a re-writ­ing that in­tro­duces an er­ror of norm at most . Since tends to zero in the limit,

Now we’ll show that this value func­tion need not be bounded. Imag­ine that the agent has two ac­tion, and . The util­ity is an inde weighted sum of the num­ber of times the agent chooses . For , let be a Boolean that is if the agent chose on his­tory at time , and oth­er­wise. Let , and define the util­ity:

Now let be the policy of always choos­ing , and the policy of always choos­ing . Then for any his­tory ,

This means that if , the value of will in­crease with­out limits as gets longer, even though it­self is bounded (by ). If we re­place with , we keep the fact that is bounded, and, since will even­tu­ally be greater that , for all , we can see that will in­crease with­out limits for any value of .

• I’m try­ing to wrap my head around the case where there are two wor­lds, w1 and w2; w2 is bet­ter than w1, but mov­ing from w1 to w2 is bad (ie. kill ev­ery­one and re­plac­ing them with differ­ent peo­ple who are hap­pier, and we think this is bad).

I think for the equiv­alence to work in this case, the util­ity func­tion U also needs to de­pend on your cur­rent state—if it’s the same for all states, then the agent would always pre­fer to move from w1 to w2 and erase it’s mem­ory of the past when max­i­miz­ing the util­ity func­tion, wheras it would act cor­rectly with the re­ward func­tion.

• >erase it’s mem­ory

That only works if the agent is mo­ti­vated by some­thing like “max­imise your be­lief in what the ex­pected value of U is”, rather than “max­imise the ex­pected value of U”. If you’ve got that prob­lem, then the agent is un­sal­vage­able—it could just edit its mem­ory to make it­self be­lieve U is max­imised.

• Say w2a is the world where the agent starts in w2 and w2b is the world that re­sults af­ter the agent moves from w1 to w2.

Without con­sid­er­ing the agent’s mem­ory part of the world, it seems like the prob­lem is worse: the only way to dis­t­in­guish be­tween w2a and w2b is the agent’s mem­ory of past events, so it seems that leav­ing the agent’s mem­ory over the past out of the util­ity func­tion re­quires U(w2a) = U(w2b)

• U could de­pend on the en­tire his­tory of states (rather than on the agent’s mem­ory of that his­tory).

• Ah, mi­s­un­der­stood that, thanks.