Thanks, that was an interesting read! Could you guide me a bit through its applicability to my examples? If I understand correctly, the bits where I tried using Bayesian guesses to find an optimal policy were me applying the SIA, yet your theorem didn’t work out—those situations had only one optimum, and it was wrong. Did I actually apply something slightly different, make a mistake, or what?
THIS IS A SIMULATION YOU ARE INSTANCE #4859 YOU ARE IN EITHER NODE X OR Y YOU CAN CHOOSE A PROBABILITY TO [Advance] or [Exit] THE UTILITY IS AS FOLLOWS: * EXITING AT X IS WORTH 0 AND TERMINATES THE INSTANCE * ADVANCING AT X REINITIALISES THE SIMULATION AT Y * EXITING AT Y IS WORTH 4 AND TERMINATES THE INSTANCE * ADVANCING AT Y IS WORTH 1 AND TERMINATES THE INSTANCE PLEASE USE THE KEYPAD TO INPUT THE DESIRED PROBABILITY TO MAXIMISE YOUR UTILITY GOOD LUCK
Set of states: {X, Y, XE, YE, YA}
Set of actions: {Advance, Exit}
Set of observations: {”″}
Initial state: X
Transition function for non-terminal states:
t(X, Advance) = Y
t(X, Exit) = XE
t(Y, Advance) = YA
t(Y, Exit) = YE
Terminal states: {XE, YE, YA}
Utility function:
u(XE) = 0
u(YE) = 4
u(YA) = 1
Policy can be parameterized by p = exit probability.
Frequencies for non-terminal states (un-normalized):
Fp(X)=1
Fp(Y)=1−p
SIA un-normalized probabilities:
SIAp(X|"")=1
SIAp(Y|"")=1−p
Note we have only one possible observation, so SIA un-normalized probabilities match frequencies.
State values for non-terminals:
Vp(Y)=4p+1−p=3p+1
Vp(X)=(1−p)Vp(Y)=(1−p)(3p+1)=−3p2+2p+1
Q values for non-terminals:
Qp(Y,Exit)=4
Qp(Y,Advance)=1
Qp(X,Exit)=0
Qp(X,Advance)=Vp(Y)=3p+1
For local optimality we compute partial derivatives.
To optimize we set 2−6p=0 i.e. p=1/3. Remember p is the probability of exit (sorry it’s reversed from your usage!). This matches what you computed as globally optimal.
I’m not going to do this for all of the examples… Is there a specific example where you think the theorem fails?
Yeah I follow that, my point is the example where I use the Bayesian inference. But really I think I actually got the difference: it’s simply because you’re not normalising the SIA probabilities, and I am. The 2−p denominator is what ruins it. But I’m surprised unnormalised probabilities are actually the right way to go. Is there a mathematical intuition behind it? I’ll have to go through your original post more in depth.
If we look at a formula like Fp(X)(Qp(X,Exit)−Qp(X,Advance))+Fp(Y)(Qp(Y,Exit)−Qp(Y,Advance))=0, it shouldn’t make a difference when scaling by a positive constant.
The nice thing about un-normalized probabilities is that if they are 0 you don’t get a division by zero. It just becomes a trivial condition of 0=0 if the frequencies are zero. It helps specifically in the local optimality condition, handling the case of an observation that is encountered with probability 0 given the policy.
The problem is that the denominator isn’t constant, it’s a function of p, at least in the version in which your existence depends on your choices. That’s the difference I highlighted with my 2nd vs 3rd example; simply adding a dummy node Z makes the denominator constant and fixes the issue. But when the denominator depends on strategy (this also happens in the example with the notification) it affects the optimum.
Ah, I think I see the problem now. You’re applying the wrong framework. You have your agent pick the a which maximizes the expected reward (or one of those who do, in case it’s not unique, which I assume is what the ∈ operator accounts for here), but that’s a pure strategy, not a mixed one. Given that in the original absent-minded driver problem there are essentially no observations, SIAπ(s|o)=Fπ(s) and the resulting action would always be the same, meaning an agent following this framework would always end up with either 0 or 1, depending on the precise value of the higher reward.
If you want a mixed strategy you ought to perform a functional optimization of π(a|o). In our case this means optimizing by p because really that’s the only parameter of the policy, which is a simple Bernoulli distribution, but in an RL system this could for example be the set θ of all the weights of a neural network. And if you do or don’t normalize by a denominator that is itself a function of π (and thus of its parameters) obviously makes a ton of difference to where its maximum is.
No, check the post more carefully… π(a|o) in the above condition is a probability. The local optimality condition says that any action taken with nonzero probability must have maximal expected value. Which is standard for Nash equilibrium.
Critically, the agent can mix over strategies when the expected utility from each is equal and maximal. That’s very standard for Nash equilibrium e.g. in the matching pennies game.
Given a policy you can evaluate the expected utility of any action. This depends on the policy.
In the absent minded driver problem, if the policy is to exit 10% of the time, then the ‘exit’ action has higher expected utility than the ‘advance’ action. Whereas if the policy is to exit 90% of the time, then the ‘advance’ action has higher expected utility.
This is because the policy affects the SIA probabilities and Q values. The higher your exit probability, the more likely you are at node X (and therefore should advance).
The local optimality condition for a policy is that each action the policy assigns non-zero probability to must have optimal expected utility relative to the policy. It’s reflective like Nash equilibrium.
This is clear from the formula:
∀o∈O,a∈A:π(a|o)>0⇒a∈argmaxa′∈A∑sSIAπ(s|o)Qπ(s,a′)
Note that SIA and Q depend on π. This is the condition for local optimality of π. It is about each action that π assigns non-zero probability to being optimal relative to π.
(That’s the local optimality condition; there’s also global optimality, where utility is directly a function of the policy, and is fairly obvious. The main theorem of the post is: Global optimality implies local optimality.)
I am interpreting that formula as “compute this quantity in the sum and find the a′ from the set of all possible actions A that maximizes it, then do that”. Am I wrong? If that’s the interpretation, then that policy will always produce a pure strategy in the case of the absent-minded driver. I could actually write down all the functions for it since they are essentially simple lookup tables for such a small case.
The policy assigns non-zero probability to both exit and advance. But only one of the two has higher expected utility. Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
Think of it as a predicate on policies. The predicate (local optimality) is true when, for each action the policy assigns non-zero probability to, that action maximizes expected utility relative to the policy.
I am interpreting that formula as “compute this quantity in the sum and find the a’ from the set of all possible actions \mathcal{A} that maximizes it, then do that”. Am I wrong?
Yes. It’s a predicate on policies. If two different actions (given an observation) maximize expected utility, then either action can be taken. Your description doesn’t allow that, because it assumes there is a single a’ that maximizes expected utility. Whereas, with a predicate on policies, we could potentially allow multiple actions.
Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
Yes, exactly. Look up Nash equilibrium in matching pennies. It’s pretty similar. (Except your expected utilities as a function of your action depend on the opponent’s actions in matching pennies, and your own action in absent minded driver.)
I analyzed a general class of these problems here. Upshot: every optimal UDT solution is also an optimal CDT+SIA solution, but not vice versa.
Thanks, that was an interesting read! Could you guide me a bit through its applicability to my examples? If I understand correctly, the bits where I tried using Bayesian guesses to find an optimal policy were me applying the SIA, yet your theorem didn’t work out—those situations had only one optimum, and it was wrong. Did I actually apply something slightly different, make a mistake, or what?
Set of states: {X, Y, XE, YE, YA}
Set of actions: {Advance, Exit}
Set of observations: {”″}
Initial state: X
Transition function for non-terminal states:
t(X, Advance) = Y
t(X, Exit) = XE
t(Y, Advance) = YA
t(Y, Exit) = YE
Terminal states: {XE, YE, YA}
Utility function:
u(XE) = 0
u(YE) = 4
u(YA) = 1
Policy can be parameterized by p = exit probability.
Frequencies for non-terminal states (un-normalized):
Fp(X)=1
Fp(Y)=1−p
SIA un-normalized probabilities:
SIAp(X|"")=1
SIAp(Y|"")=1−p
Note we have only one possible observation, so SIA un-normalized probabilities match frequencies.
State values for non-terminals:
Vp(Y)=4p+1−p=3p+1
Vp(X)=(1−p)Vp(Y)=(1−p)(3p+1)=−3p2+2p+1
Q values for non-terminals:
Qp(Y,Exit)=4
Qp(Y,Advance)=1
Qp(X,Exit)=0
Qp(X,Advance)=Vp(Y)=3p+1
For local optimality we compute partial derivatives.
dp("",Exit,Advance,Y)=Qp(Y,Exit)−Qp(Y,Advance)=3
dp("",Exit,Advance,X)=(1−p)∗dp("",Exit,Advance,Y)+Qp(X,Exit)−Qp(X,Advance)=3(1−p)−(3p+1)=2−6p
By the post, this can be equivalently written:
dp("",Exit,Advance,X)=Fp(X)(Qp(X,Exit)−Qp(X,Advance))+Fp(Y)(Qp(Y,Exit)−Qp(Y,Advance))=−(3p+1)+(1−p)(4−1)=−3p−1+3−3p=2−6p
To optimize we set 2−6p=0 i.e. p=1/3. Remember p is the probability of exit (sorry it’s reversed from your usage!). This matches what you computed as globally optimal.
I’m not going to do this for all of the examples… Is there a specific example where you think the theorem fails?
Yeah I follow that, my point is the example where I use the Bayesian inference. But really I think I actually got the difference: it’s simply because you’re not normalising the SIA probabilities, and I am. The 2−p denominator is what ruins it. But I’m surprised unnormalised probabilities are actually the right way to go. Is there a mathematical intuition behind it? I’ll have to go through your original post more in depth.
If we look at a formula like Fp(X)(Qp(X,Exit)−Qp(X,Advance))+Fp(Y)(Qp(Y,Exit)−Qp(Y,Advance))=0, it shouldn’t make a difference when scaling by a positive constant.
The nice thing about un-normalized probabilities is that if they are 0 you don’t get a division by zero. It just becomes a trivial condition of 0=0 if the frequencies are zero. It helps specifically in the local optimality condition, handling the case of an observation that is encountered with probability 0 given the policy.
The problem is that the denominator isn’t constant, it’s a function of p, at least in the version in which your existence depends on your choices. That’s the difference I highlighted with my 2nd vs 3rd example; simply adding a dummy node Z makes the denominator constant and fixes the issue. But when the denominator depends on strategy (this also happens in the example with the notification) it affects the optimum.
I still think it shouldn’t matter...
We should have f(p) / g(p) = 0 in exactly the cases where f(p) = 0, as long as g(p) is always positive.
In the formula
∀o∈O,a∈A:π(a|o)>0⇒a∈argmaxa′∈A∑sSIAπ(s|o)Qπ(s,a′)
we could have multiplied SIAπ(s|o) for all s by the same (always-positive) function of π,o and gotten the same result.
Ah, I think I see the problem now. You’re applying the wrong framework. You have your agent pick the a which maximizes the expected reward (or one of those who do, in case it’s not unique, which I assume is what the ∈ operator accounts for here), but that’s a pure strategy, not a mixed one. Given that in the original absent-minded driver problem there are essentially no observations, SIAπ(s|o)=Fπ(s) and the resulting action would always be the same, meaning an agent following this framework would always end up with either 0 or 1, depending on the precise value of the higher reward.
If you want a mixed strategy you ought to perform a functional optimization of π(a|o). In our case this means optimizing by p because really that’s the only parameter of the policy, which is a simple Bernoulli distribution, but in an RL system this could for example be the set θ of all the weights of a neural network. And if you do or don’t normalize by a denominator that is itself a function of π (and thus of its parameters) obviously makes a ton of difference to where its maximum is.
No, check the post more carefully… π(a|o) in the above condition is a probability. The local optimality condition says that any action taken with nonzero probability must have maximal expected value. Which is standard for Nash equilibrium.
Critically, the agent can mix over strategies when the expected utility from each is equal and maximal. That’s very standard for Nash equilibrium e.g. in the matching pennies game.
But that is not the case for the absent-minded driver. The mix has higher expected utility than either individual pure strategy.
Given a policy you can evaluate the expected utility of any action. This depends on the policy.
In the absent minded driver problem, if the policy is to exit 10% of the time, then the ‘exit’ action has higher expected utility than the ‘advance’ action. Whereas if the policy is to exit 90% of the time, then the ‘advance’ action has higher expected utility.
This is because the policy affects the SIA probabilities and Q values. The higher your exit probability, the more likely you are at node X (and therefore should advance).
The local optimality condition for a policy is that each action the policy assigns non-zero probability to must have optimal expected utility relative to the policy. It’s reflective like Nash equilibrium.
This is clear from the formula:
∀o∈O,a∈A:π(a|o)>0⇒a∈argmaxa′∈A∑sSIAπ(s|o)Qπ(s,a′)
Note that SIA and Q depend on π. This is the condition for local optimality of π. It is about each action that π assigns non-zero probability to being optimal relative to π.
(That’s the local optimality condition; there’s also global optimality, where utility is directly a function of the policy, and is fairly obvious. The main theorem of the post is: Global optimality implies local optimality.)
I am interpreting that formula as “compute this quantity in the sum and find the a′ from the set of all possible actions A that maximizes it, then do that”. Am I wrong? If that’s the interpretation, then that policy will always produce a pure strategy in the case of the absent-minded driver. I could actually write down all the functions for it since they are essentially simple lookup tables for such a small case.
The policy assigns non-zero probability to both exit and advance. But only one of the two has higher expected utility. Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
Think of it as a predicate on policies. The predicate (local optimality) is true when, for each action the policy assigns non-zero probability to, that action maximizes expected utility relative to the policy.
Yes. It’s a predicate on policies. If two different actions (given an observation) maximize expected utility, then either action can be taken. Your description doesn’t allow that, because it assumes there is a single a’ that maximizes expected utility. Whereas, with a predicate on policies, we could potentially allow multiple actions.
Yes, exactly. Look up Nash equilibrium in matching pennies. It’s pretty similar. (Except your expected utilities as a function of your action depend on the opponent’s actions in matching pennies, and your own action in absent minded driver.)