Anders_H comments on In partially observable environments, stochastic policies can be optimal

Anders_H 19 Jul 2016 16:51 UTC
0 points
Since we are discussing sequential decisions, deterministic strategies are not limited to “A” and “B” : You can choose deterministic sequences such as alternating between A and B. The expected value of this strategy is 0, which is equal to the random strategy.

Possibly you intend to rule this out by specifying that the process is “memoryless” but if I understand correctly a process can be described as Markov as long as the current state carries all the information, regardless of whether the current state is observed. Correct me if I am wrong on this