Computing an exact quantilal policy
Earlier we established that the quantilal policy can be computed in polynomial time to any given approximation (see “Proposition 5”). Now we show that an exact quantilal policy can be computed in polynomial time (in particular there is always a rational quantilal policy).
We assume geometric time discount throughout.
Consider . Define the linear operators and by
(Note that this is the transpose of the defined in “Proposition A.3” of the previous essay.)
Then, if and only if there is s.t.
(This is actually well known, but we spell out the proof to be self-contained.)
Suppose that . We already know that this implies that there is a stationary policy s.t. (we abuse notation in the obvious way): see the proofs of “Proposition 2” and “Proposition 3″. Define the linear operator by
It follows that
Also, obviously . We get
Conversely, suppose that is as above. Since , there is s.t. for any , if then
Again, we have
Also, for the same reason as before
By the assumption, the left hand side equals . We conclude
Assuming all parameters are rational like before, there is a polynomial time algorithm that computes a quantilal policy.
The algorithm starts by solving the following linear program. The indeterminates are and . The goal is maximizing . The constraints are
Then, the algorithm computes s.t. for any , if then
For s.t. , is arbitrary.
Now we need to explain why this algorithm is correct.
Observe that, the first 3 constraints mean that defined by lies in (by Lemma 1) and, moreover, for s.t. . It remains to show that the linear program amounts to maximizing inside . Indeed, the 4th constraint just means that . The last constraint implies that we are actually maximizing
The latter is indeed , since every corresponds to a pure strategy of the adversary in the corresponding zero-sum game: namely, this strategy is setting the penalty function to
(Strategies that place non-vanishing penalty on states outside of become irrelevant after imposing the 4th constraint. The remaining penalty functions form a simplex with vertices as above.)