I was actually going to write something like this up, but you beat me to it! My idea was pretty similar. The main difference is that in my setting, a utility function only cares about outcomes (rather than intermediate events such as X). Here’s most of what I had written so far (feel free to just skim it, given that it’s mostly the same thing):
I am interested in the question of when we can substitute changes in beliefs for changes in
utility function and vice versa.
Here is a simple setting of utility maximization. There are n possible worlds, 2 actions, and m outcomes.
An agent has some prior p over possible worlds, and some utility function u over outcomes.
The agent can be exposed to different independent experiments. In an experiment, the agent
does not know the possible world ω, but indirectly observes ω and thereby sees an observation O with probability P(O|ω).
Then the agent chooses action A∈{0,1}. Subsequently, some outcome R will occur
(stochastically) based on the possible world and action, and the agent values this outcome
according to u.
Let M be the “transition difference matrix” defined as
Mij=P(R=j|ω=i,A=1)−P(R=j|ω=i,A=0)
It is not unreasonable to believe that M can be determined a priori: if possible worlds are
(say) probabilistic Turing machines accepting an action as input and
returning an outcome as output, then indeed M is known a priori.
Let x∗y stand for the elementwise product between vectors x and y,
and let x÷y stand for the elementwise quotient.
Also, define the likelihood l(o) to be a vector with li(o)=P(ω=i|O=o). The
agent’s posterior distribution over the possible world after seeing evidence L will then be proportional to p∗l(o), using Bayes’ rule.
We can compute the expected utility difference between actions in an experiment as
E[u(R)|O=o,A=1]−E[u(R)|O=o,A=0]∝(p∗l(o))TMu
The agent’s behavior across experiments is entirely determined by the function l↦(p∗l)TMu.
Let us rewrite this as:
(p∗l)TMu=(diag(p)l)TMu=lTdiag(p)Mu=lT(p∗Mu)=(p∗Mu)Tl
So in fact, the behavior is determined by the vector p∗Mu.
This vector has
n entries, and entry i is equal to the prior probability of possible world i times
the expected utility difference (between actions 1 and 0) if we are in
possible world i. We could call this vector p∗Mu the policy of (p,u).
Now we can ask two interesting questions:
For what (p,u,M,p′) does there exist u′ such that the policy of (p,u) is the policy of (p′,u′)? That is, can we replace changes in beliefs with changes in utility function?
For what (p,u,M,u′) does there exist p′ such that the policy of (p,u) is the policy of (p′,u′)? That is, can we replace changes in utility function with changes in beliefs?
For question 1:
p′∗Mu′=p∗Mu⇔diag(p′)Mu′=diag(p)Mu⇔Mu′=diag(p′)−1diag(p)Mu⇔Mu′=diag(p÷p′)Mu⇔Mu′=Mu∗p÷p′
We can find u′ satisfying this if and only if p′ assigns probability 0 to any outcome that p does and
M is right-invertible. That is,
p′ must not have a smaller support than p, and
we must be able to span Rm with the by-action differences in
outcome distributions for each possible world. I expect that M
will be right-invertible in most practical cases (there will be more possible worlds than outcomes).
For question 2: this will not be true in general. If for some possible world i, (Mu)i has a different sign from (Mu′)i (that is, u and u′ recommend different actions in possible world i), then the policies must be different, unless pi=0. So we are not able to replace changes in utility function with changes in beliefs, in general. There is a way to do this by making some possible worlds observationally equivalent, but I haven’t worked through all the details.
I was actually going to write something like this up, but you beat me to it! My idea was pretty similar. The main difference is that in my setting, a utility function only cares about outcomes (rather than intermediate events such as X). Here’s most of what I had written so far (feel free to just skim it, given that it’s mostly the same thing):
I am interested in the question of when we can substitute changes in beliefs for changes in utility function and vice versa.
Here is a simple setting of utility maximization. There are n possible worlds, 2 actions, and m outcomes. An agent has some prior p over possible worlds, and some utility function u over outcomes. The agent can be exposed to different independent experiments. In an experiment, the agent does not know the possible world ω, but indirectly observes ω and thereby sees an observation O with probability P(O|ω). Then the agent chooses action A∈{0,1}. Subsequently, some outcome R will occur (stochastically) based on the possible world and action, and the agent values this outcome according to u.
Let M be the “transition difference matrix” defined as Mij=P(R=j|ω=i,A=1)−P(R=j|ω=i,A=0) It is not unreasonable to believe that M can be determined a priori: if possible worlds are (say) probabilistic Turing machines accepting an action as input and returning an outcome as output, then indeed M is known a priori.
Let x∗y stand for the elementwise product between vectors x and y, and let x÷y stand for the elementwise quotient. Also, define the likelihood l(o) to be a vector with li(o)=P(ω=i|O=o). The agent’s posterior distribution over the possible world after seeing evidence L will then be proportional to p∗l(o), using Bayes’ rule. We can compute the expected utility difference between actions in an experiment as E[u(R)|O=o,A=1]−E[u(R)|O=o,A=0]∝(p∗l(o))TMu
The agent’s behavior across experiments is entirely determined by the function l↦(p∗l)TMu. Let us rewrite this as: (p∗l)TMu=(diag(p)l)TMu=lTdiag(p)Mu=lT(p∗Mu)=(p∗Mu)Tl So in fact, the behavior is determined by the vector p∗Mu. This vector has n entries, and entry i is equal to the prior probability of possible world i times the expected utility difference (between actions 1 and 0) if we are in possible world i. We could call this vector p∗Mu the policy of (p,u).
Now we can ask two interesting questions:
For what (p,u,M,p′) does there exist u′ such that the policy of (p,u) is the policy of (p′,u′)? That is, can we replace changes in beliefs with changes in utility function?
For what (p,u,M,u′) does there exist p′ such that the policy of (p,u) is the policy of (p′,u′)? That is, can we replace changes in utility function with changes in beliefs?
For question 1: p′∗Mu′=p∗Mu⇔diag(p′)Mu′=diag(p)Mu⇔Mu′=diag(p′)−1diag(p)Mu⇔Mu′=diag(p÷p′)Mu⇔Mu′=Mu∗p÷p′
We can find u′ satisfying this if and only if p′ assigns probability 0 to any outcome that p does and M is right-invertible. That is, p′ must not have a smaller support than p, and we must be able to span Rm with the by-action differences in outcome distributions for each possible world. I expect that M will be right-invertible in most practical cases (there will be more possible worlds than outcomes).
For question 2: this will not be true in general. If for some possible world i, (Mu)i has a different sign from (Mu′)i (that is, u and u′ recommend different actions in possible world i), then the policies must be different, unless pi=0. So we are not able to replace changes in utility function with changes in beliefs, in general. There is a way to do this by making some possible worlds observationally equivalent, but I haven’t worked through all the details.