Computing an exact quantilal policy

Ear­lier we es­tab­lished that the quan­tilal policy can be com­puted in polyno­mial time to any given ap­prox­i­ma­tion (see “Propo­si­tion 5”). Now we show that an ex­act quan­tilal policy can be com­puted in polyno­mial time (in par­tic­u­lar there is always a ra­tio­nal quan­tilal policy).

We as­sume ge­o­met­ric time dis­count through­out.

Lemma

Con­sider . Define the lin­ear op­er­a­tors and by

(Note that this is the trans­pose of the defined in “Propo­si­tion A.3” of the pre­vi­ous es­say.)

Then, if and only if there is s.t.

Proof

(This is ac­tu­ally well known, but we spell out the proof to be self-con­tained.)

Sup­pose that . We already know that this im­plies that there is a sta­tion­ary policy s.t. (we abuse no­ta­tion in the ob­vi­ous way): see the proofs of “Propo­si­tion 2” and “Propo­si­tion 3″. Define the lin­ear op­er­a­tor by

It fol­lows that

Define by

We have

Also, ob­vi­ously . We get

Con­versely, sup­pose that is as above. Since , there is s.t. for any , if then

Again, we have

Also, for the same rea­son as before

By the as­sump­tion, the left hand side equals . We conclude

Theorem

As­sum­ing all pa­ram­e­ters are ra­tio­nal like be­fore, there is a polyno­mial time al­gorithm that com­putes a quan­tilal policy.

Proof

The al­gorithm starts by solv­ing the fol­low­ing lin­ear pro­gram. The in­de­ter­mi­nates are and . The goal is max­i­miz­ing . The con­straints are

Then, the al­gorithm com­putes s.t. for any , if then

For s.t. , is ar­bi­trary.

Now we need to ex­plain why this al­gorithm is cor­rect.

Ob­serve that, the first 3 con­straints mean that defined by lies in (by Lemma 1) and, more­over, for s.t. . It re­mains to show that the lin­ear pro­gram amounts to max­i­miz­ing in­side . In­deed, the 4th con­straint just means that . The last con­straint im­plies that we are ac­tu­ally maximizing

The lat­ter is in­deed , since ev­ery cor­re­sponds to a pure strat­egy of the ad­ver­sary in the cor­re­spond­ing zero-sum game: namely, this strat­egy is set­ting the penalty func­tion to

(Strate­gies that place non-van­ish­ing penalty on states out­side of be­come ir­rele­vant af­ter im­pos­ing the 4th con­straint. The re­main­ing penalty func­tions form a sim­plex with ver­tices as above.)

No comments.