In partially observable environments, stochastic policies can be optimal

Stuart_Armstrong19 Jul 2016 10:42 UTC

12 points

I always had the informal impression that the optimal policies were deterministic (choosing the best option, rather than some mix of options). Of course, this is not the case when facing other agents, but I had the impression this would hold when facing the environment rather that other players.

But stochastic policies can also be needed if the environment is partially observable, at least if the policy is Markov (memoryless). Consider the following POMDP (partially observable Markov decision process):

There are two states, 1a and 1b, and the agent cannot tell which one they’re in. Action A in state 1a and B in state 1b, gives a reward of -R and keeps the agent in the same place. Action B in state 1a and A in state 1b, gives a reward of R and moves the agent to the other state.

The returns for the two deterministic policies—A and B—are -R every turn except maybe for the first. While the return for the stochastic policy of 0.5A + 0.5B is 0 per turn.

Of course, if the agent can observe the reward, the environment is no longer partially observable (though we can imagine the reward is delayed until later). And the general policy of “alternate A and B” is more effective that the 0.5A + 0.5B policy. Still, that stochastic policy is the best of the memoryless policies available in this POMDP.

Stuart_Armstrong19 Jul 2016 10:42 UTC

12 points

9 comments1 min readLW link Archive

Gram_Stone 19 Jul 2016 14:19 UTC
8 points
Is the Absent-minded Driver an example of a single-player decision problem whose optimal policy is stochastic? Isn’t the optimal policy to condition your decision on an unbiased coin?

I ask because it seems like it might make a good intuitive example, as opposed to the POMDP in the OP. But I’m not sure who your intended audience is.
- Stuart_Armstrong 19 Jul 2016 17:16 UTC
  5 points
  Parent
  Yes, you can see this POMDP as a variant of the absent minded-driver, and get that result.
buybuydandavis 21 Jul 2016 12:21 UTC
1 point

I always had the informal impression that the optimal policies were deterministic

Really? I wouldn’t have ever thought that at all. Why do you think you thought that?

when facing the environment rather that other players. But stochastic policies can also be needed if the environment is partially observable

Isn’t kind of what a player is? Part of the environment with a strategy and only partially observable states?

Although for this player, don’t you have an optimal strategy, except for the first move? The Markov “Player” seems to like change.

Isn’t this strategy basically optimal? ABABABABABAB… Deterministic, just not the same every round. Am I missing something?
- Stuart_Armstrong 22 Jul 2016 18:51 UTC
  2 points
  Parent
  
  ABABABABABAB...
  
  It’s deterministic, but not memoryless.
  
  But it really does seem that there is a difference between facing an environment and another player—the other player adapts to your strategy in a way the environment doesn’t. The environment only adapts to your actions.
  
  I think for unbounded agents facing the environment, a deterministic policy is always optimal, but this might not be the case for bounded agents.
  - buybuydandavis 22 Jul 2016 22:56 UTC
    2 points
    Parent
    
    I always had the informal impression that the optimal policies were deterministic
    
    So an impression that optimal memoryless polices were deterministic?
    
    That seems even less likely to me. If the environment has state, and you’re not allowed to, you’re playing at a disadvantage. Randomness is one way to counter state when you don’t have state.
    
    But it really does seem that there is a difference between facing an environment and another player—the other player adapts to your strategy in a way the environment doesn’t. The environment only adapts to your actions.
    
    I still don’t see a difference. Your strategy is only known from your actions by both another player and the environment, so they’re in the same boat.
    
    Labeling something the environment or a player seems arbitrary and irrelevant. What capabilities are we talking about? Are these terms of art for which some standard specifying capability exists?
    
    What formal distinctions have been made between players and environments?
    - Stuart_Armstrong 23 Jul 2016 16:56 UTC
      2 points
      Parent
      Take a game with a mixed strategy Nash equilibrium. If you and the other player follow this, using source of randomness that remain random for the other player, then it is never to your advantage to deviate from this. You play this game, again and again, against another player or against the environment.
      
      Consider an environment in which the opponent’s strategies are in an evolutionary arms race, trying to best beat you; this is an environmental model. Under this, you’d tend to follow the Nash equilibrium on average, but, at (almost) any given turn, there’s a deterministic choice that’s a bit better than being stochastic, and it’s determined by the current equilibrium of strategies of the opponent/environment.
      
      However, if you’re facing another player, and you make deterministic choices, you’re vulnerable if ever they figure out your choice. This is because they can peer into your algorithm, not just track your previous actions. To avoid this, you have to be stochastic.
      
      This seems like a potentially relevant distinction.
  - Lumifer 22 Jul 2016 19:44 UTC
    2 points
    Parent
    
    The environment only adapts to your actions.
    
    Is this how you define environment?
    - Stuart_Armstrong 23 Jul 2016 16:43 UTC
      2 points
      Parent
      At least as an informal definition, it seems pretty good.
Anders_H 19 Jul 2016 16:51 UTC
0 points
Since we are discussing sequential decisions, deterministic strategies are not limited to “A” and “B” : You can choose deterministic sequences such as alternating between A and B. The expected value of this strategy is 0, which is equal to the random strategy.

Possibly you intend to rule this out by specifying that the process is “memoryless” but if I understand correctly a process can be described as Markov as long as the current state carries all the information, regardless of whether the current state is observed. Correct me if I am wrong on this