dr_s comments on The absent-minded variations

dr_s 20 May 2025 12:14 UTC
2 points
0

the agent can mix over strategies when the expected utility from each is equal and maximal

But that is not the case for the absent-minded driver. The mix has higher expected utility than either individual pure strategy.
- jessicata 20 May 2025 14:24 UTC
  2 points
  0
  Parent
  Given a policy you can evaluate the expected utility of any action. This depends on the policy.
  
  In the absent minded driver problem, if the policy is to exit 10% of the time, then the ‘exit’ action has higher expected utility than the ‘advance’ action. Whereas if the policy is to exit 90% of the time, then the ‘advance’ action has higher expected utility.
  
  This is because the policy affects the SIA probabilities and Q values. The higher your exit probability, the more likely you are at node X (and therefore should advance).
  
  The local optimality condition for a policy is that each action the policy assigns non-zero probability to must have optimal expected utility relative to the policy. It’s reflective like Nash equilibrium.
  
  This is clear from the formula:
  
  $\forall o \in O, a \in A : π (a | o) > 0 \Rightarrow a \in arg {max}_{a^{'} \in A} \sum_{s} S I A_{π} (s | o) Q_{π} (s, a^{'})$
  
  Note that SIA and Q depend on $π$ . This is the condition for local optimality of $π$ . It is about each action that $π$ assigns non-zero probability to being optimal relative to $π$ .
  
  (That’s the local optimality condition; there’s also global optimality, where utility is directly a function of the policy, and is fairly obvious. The main theorem of the post is: Global optimality implies local optimality.)
  - dr_s 20 May 2025 15:23 UTC
    2 points
    0
    Parent
    I am interpreting that formula as “compute this quantity in the sum and find the $a^{'}$ from the set of all possible actions $A$ that maximizes it, then do that”. Am I wrong? If that’s the interpretation, then that policy will always produce a pure strategy in the case of the absent-minded driver. I could actually write down all the functions for it since they are essentially simple lookup tables for such a small case.
    
    The policy assigns non-zero probability to both exit and advance. But only one of the two has higher expected utility. Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
    - jessicata 20 May 2025 16:07 UTC
      2 points
      0
      Parent
      Think of it as a predicate on policies. The predicate (local optimality) is true when, for each action the policy assigns non-zero probability to, that action maximizes expected utility relative to the policy.
      
      I am interpreting that formula as “compute this quantity in the sum and find the a’ from the set of all possible actions \mathcal{A} that maximizes it, then do that”. Am I wrong?
      
      Yes. It’s a predicate on policies. If two different actions (given an observation) maximize expected utility, then either action can be taken. Your description doesn’t allow that, because it assumes there is a single a’ that maximizes expected utility. Whereas, with a predicate on policies, we could potentially allow multiple actions.
      
      Or is your point that the only self-consistent policy is the one where both have equal expected utility, and thus I can in fact choose either? Though then I have to choose according to the probabilities specified in the policy.
      
      Yes, exactly. Look up Nash equilibrium in matching pennies. It’s pretty similar. (Except your expected utilities as a function of your action depend on the opponent’s actions in matching pennies, and your own action in absent minded driver.)