Given a policy you can evaluate the expected utility of any action. This depends on the policy.
In the absent minded driver problem, if the policy is to exit 10% of the time, then the ‘exit’ action has higher expected utility than the ‘advance’ action. Whereas if the policy is to exit 90% of the time, then the ‘advance’ action has higher expected utility.
This is because the policy affects the SIA probabilities and Q values. The higher your exit probability, the more likely you are at node X (and therefore should advance).
The local optimality condition for a policy is that each action the policy assigns non-zero probability to must have optimal expected utility relative to the policy. It’s reflective like Nash equilibrium.
This is clear from the formula:
∀o∈O,a∈A:π(a|o)>0⇒a∈argmaxa′∈A∑sSIAπ(s|o)Qπ(s,a′)
Note that SIA and Q depend on π. This is the condition for local optimality of π. It is about each action that π assigns non-zero probability to being optimal relative to π.
(That’s the local optimality condition; there’s also global optimality, where utility is directly a function of the policy, and is fairly obvious. The main theorem of the post is: Global optimality implies local optimality.)
I am interpreting that formula as “compute this quantity in the sum and find the a′ from the set of all possible actions A that maximizes it, then do that”. Am I wrong? If that’s the interpretation, then that policy will always produce a pure strategy in the case of the absent-minded driver. I could actually write down all the functions for it since they are essentially simple lookup tables for such a small case.
The policy assigns non-zero probability to both exit and advance. But only one of the two has higher expected utility. Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
Think of it as a predicate on policies. The predicate (local optimality) is true when, for each action the policy assigns non-zero probability to, that action maximizes expected utility relative to the policy.
I am interpreting that formula as “compute this quantity in the sum and find the a’ from the set of all possible actions \mathcal{A} that maximizes it, then do that”. Am I wrong?
Yes. It’s a predicate on policies. If two different actions (given an observation) maximize expected utility, then either action can be taken. Your description doesn’t allow that, because it assumes there is a single a’ that maximizes expected utility. Whereas, with a predicate on policies, we could potentially allow multiple actions.
Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
Yes, exactly. Look up Nash equilibrium in matching pennies. It’s pretty similar. (Except your expected utilities as a function of your action depend on the opponent’s actions in matching pennies, and your own action in absent minded driver.)
But that is not the case for the absent-minded driver. The mix has higher expected utility than either individual pure strategy.
Given a policy you can evaluate the expected utility of any action. This depends on the policy.
In the absent minded driver problem, if the policy is to exit 10% of the time, then the ‘exit’ action has higher expected utility than the ‘advance’ action. Whereas if the policy is to exit 90% of the time, then the ‘advance’ action has higher expected utility.
This is because the policy affects the SIA probabilities and Q values. The higher your exit probability, the more likely you are at node X (and therefore should advance).
The local optimality condition for a policy is that each action the policy assigns non-zero probability to must have optimal expected utility relative to the policy. It’s reflective like Nash equilibrium.
This is clear from the formula:
∀o∈O,a∈A:π(a|o)>0⇒a∈argmaxa′∈A∑sSIAπ(s|o)Qπ(s,a′)
Note that SIA and Q depend on π. This is the condition for local optimality of π. It is about each action that π assigns non-zero probability to being optimal relative to π.
(That’s the local optimality condition; there’s also global optimality, where utility is directly a function of the policy, and is fairly obvious. The main theorem of the post is: Global optimality implies local optimality.)
I am interpreting that formula as “compute this quantity in the sum and find the a′ from the set of all possible actions A that maximizes it, then do that”. Am I wrong? If that’s the interpretation, then that policy will always produce a pure strategy in the case of the absent-minded driver. I could actually write down all the functions for it since they are essentially simple lookup tables for such a small case.
The policy assigns non-zero probability to both exit and advance. But only one of the two has higher expected utility. Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
Think of it as a predicate on policies. The predicate (local optimality) is true when, for each action the policy assigns non-zero probability to, that action maximizes expected utility relative to the policy.
Yes. It’s a predicate on policies. If two different actions (given an observation) maximize expected utility, then either action can be taken. Your description doesn’t allow that, because it assumes there is a single a’ that maximizes expected utility. Whereas, with a predicate on policies, we could potentially allow multiple actions.
Yes, exactly. Look up Nash equilibrium in matching pennies. It’s pretty similar. (Except your expected utilities as a function of your action depend on the opponent’s actions in matching pennies, and your own action in absent minded driver.)