Quantilal control for finite MDPs

We in­tro­duce a var­i­ant of the con­cept of a “quan­tilizer” for the set­ting of choos­ing a policy for a finite Markov de­ci­sion pro­cess (MDP), where the generic un­known cost is re­placed by an un­known penalty term in the re­ward func­tion. This is es­sen­tially a gen­er­al­iza­tion of quan­tiliza­tion in re­peated games with a cost in­de­pen­dence as­sump­tion. We show that the “quan­tilal” policy shares some prop­er­ties with the or­di­nary op­ti­mal policy, namely that (i) it can always be cho­sen to be Markov (ii) it can be cho­sen to be sta­tion­ary when time dis­count is ge­o­met­ric (iii) the “quan­tilum” value of an MDP with ge­o­met­ric time dis­count is a con­tin­u­ous piece­wise ra­tio­nal func­tion of the pa­ram­e­ters, and it con­verges when the dis­count pa­ram­e­ter ap­proaches 1. Fi­nally, we demon­strate a polyno­mial-time al­gorithm for com­put­ing the quan­tilal policy, show­ing that quan­tiliza­tion is not qual­i­ta­tively harder than or­di­nary op­ti­miza­tion.


Quan­tiliza­tion (in­tro­duced in Tay­lor 2015) is a method of deal­ing with “Ex­tremal Good­hart’s Law”. Ac­cord­ing to Ex­tremal Good­hart, when we at­tempt to op­ti­mize a util­ity func­tion by ag­gres­sively op­ti­miz­ing a proxy , we are likely to land out­side of the do­main where the proxy is use­ful. Quan­tiliza­tion ad­dresses this by as­sum­ing an un­known cost func­tion whose ex­pec­ta­tion w.r.t. some refer­ence prob­a­bil­ity mea­sure is bounded by 1. can be thought of as defin­ing the “do­main” within which is well-be­haved (for ex­am­ple it can be the prob­a­bil­ity mea­sure of choices made by hu­mans). We can then seek to max­i­mize while con­strain­ing by a fixed bound :

Alter­na­tively, we can choose some pa­ram­e­ter and max­i­mize the min­i­mal guaran­teed ex­pec­ta­tion of :

Th­ese two for­mu­la­tions are al­most equiv­a­lent since both amount to find­ing a strat­egy that is Pareto effi­cient w.r.t. to the two ob­jec­tives and . For the trade­off is gov­erned by the pa­ram­e­ter and for the trade­off is gov­erned by the pa­ram­e­ter . In­deed, it is easy to see that any is also op­ti­mal for the first crite­rion if we take , and any is also op­ti­mal for the lat­ter crite­rion for an ap­pro­pri­ate choice of (it needs to be a sub­deriva­tive of the Pareto fron­tier).

In the fol­low­ing, we will use as our start­ing point the sec­ond for­mu­la­tion, which can be thought of as a zero-sum game in which is the util­ity func­tion of an agent whose strat­egy set is , and is cho­sen by the ad­ver­sary. The quan­tilal strat­egy is the Nash equil­ibrium of the game.

This for­mu­la­tion seems nat­u­ral if we take (a mea­sure of how “op­ti­mistic” is in the do­main ) and . In par­tic­u­lar, the quan­tilum value (the value of the game) is a lower bound on the ex­pec­ta­tion of .

In prin­ci­ple, this for­mal­ism can be ap­plied to se­quen­tial in­ter­ac­tion be­tween an agent and an en­vi­ron­ment, if we re­place by the set of poli­cies. How­ever, if it is pos­si­ble to make struc­tural as­sump­tions about and , we can do bet­ter. Tay­lor ex­plores one such struc­tural as­sump­tion, namely a se­quence of in­de­pen­dent games in which both and are ad­di­tive across the games. We con­sider a more gen­eral set­ting, namely that of a finite Markov de­ci­sion pro­cess (MDP).


Given a set , the no­ta­tion will de­note the set of finite strings over alpha­bet , i.e.

de­notes the space of in­finite strings over alpha­bet , equipped with the product topol­ogy and the cor­re­spond­ing Borel sigma-alge­bra. Given and , is the -th sym­bol of the string (in our con­ven­tions, so the string be­gins from the 0th sym­bol.) Given and , the no­ta­tion means that is a pre­fix of .

Given a set , and , the no­ta­tion will in­di­cate the pre­fix of of length . That is, and .

Given a mea­surable space , we de­note the space of prob­a­bil­ity mea­sures on .

Given mea­surable spaces and , the no­ta­tion means that is a Markov ker­nel from to . Given , is the cor­re­spond­ing prob­a­bil­ity mea­sure on . Given mea­surable, . Given , . Given , is the com­po­si­tion of and , and when , is the -th com­po­si­tion power.


A finite MDP is defined by a non-empty finite set of states , a non-empty finite set of ac­tions , a tran­si­tion ker­nel and a re­ward func­tion . To spec­ify the util­ity func­tion, we also need to fix a time dis­count func­tion . This al­lows defin­ing by

We also fix an ini­tial dis­tri­bu­tion over states . In “clas­si­cal” MDP the­ory, it is suffi­cient to con­sider a de­ter­minis­tic ini­tial state , since the op­ti­mal policy doesn’t de­pend on the ini­tial state any­way. How­ever, quan­tiliza­tion is differ­ent since the worst-case cost func­tion de­pends on the ini­tial con­di­tions.

We now as­sume that the cost func­tion (or, the true util­ity func­tion ) has the same form. That is, there is some penalty func­tion s.t.

Given a policy (where the fac­tor rep­re­sents the past his­tory the fac­tor rep­re­sents the cur­rent state), we define in the usual way (the dis­tri­bu­tion over his­to­ries re­sult­ing from ). Fi­nally, we fix . We are now ready to define quan­tiliza­tion in this setting

Defi­ni­tion 1

is said to be quan­tilal rel­a­tively to refer­ence policy when

We also define the quan­tilum value by

( can­not be since tak­ing yields a lower bound of .)

In the origi­nal quan­tiliza­tion for­mal­ism, the quan­tilal strat­egy can be de­scribed more ex­plic­itly, as sam­pling ac­cord­ing to the refer­ence mea­sure from some top frac­tion of ac­tions ranked by ex­pected util­ity. Here, we don’t have an analo­gous de­scrip­tion, but we can in some sense eval­u­ate the in­fi­mum over .

For any , define by

For any , the no­ta­tion sig­nifies the Renyi di­ver­gence of or­der :

In gen­eral, .

Propo­si­tion 1

is quan­tilal rel­a­tively to refer­ence policy if and only if

Also, we have

If the max­i­miza­tion in Propo­si­tion 1 was over ar­bi­trary rather than of the form , we would get or­di­nary quan­tiliza­tion and sam­pling would be equiv­a­lent to sam­pling some top frac­tion of . How­ever, in gen­eral, the image of the op­er­a­tor is some closed con­vex set which is not the en­tire .

So far we con­sid­ered ar­bi­trary (non-sta­tion­ary) poli­cies. From clas­si­cal MDP the­ory, we know that an op­ti­mal policy can always be cho­sen to be Markov:

Defi­ni­tion 2

is said to be a Markov policy, when there is some s.t. .

Note that a pri­ori it might be un­clear whether there is even a non-sta­tion­ary quan­tilal policy. How­ever, we have

Propo­si­tion 2

For any , there ex­ists a Markov policy which is quan­tilal rel­a­tively to .

Now as­sume that our time dis­count func­tion is ge­o­met­ric, i.e. there ex­ists s.t. . Then it is known than an op­ti­mal policy can be cho­sen to be sta­tion­ary:

Defi­ni­tion 3

is said to be a sta­tion­ary policy, when there is some s.t. .

Once again, the situ­a­tion for quan­tilal poli­cies is analo­gous:

Propo­si­tion 3

If is ge­o­met­ric, then for any , there ex­ists a sta­tion­ary policy which is quan­tilal rel­a­tively to .

What is not analo­gous is that an op­ti­mal policy can be cho­sen to be de­ter­minis­tic whereas, of course, this is not the case for quan­tilal poli­cies.

It is known that the value of an op­ti­mal policy de­pends on the pa­ram­e­ters as a piece­wise ra­tio­nal func­tion, and in par­tic­u­lar it con­verges as and has a Tay­lor ex­pan­sion at . The quan­tilum value has the same prop­erty.

Propo­si­tion 4

is a piece­wise ra­tio­nal con­tin­u­ous func­tion of , , the ma­trix el­e­ments of , the val­ues of , the val­ues of and the val­ues of , with a fi­nal num­ber of “pieces”.

Corol­lary 1

As­sume that is a sta­tion­ary policy. Then, con­verges as , hold­ing all other pa­ram­e­ters fixed (in the sense that, is fixed whereas changes as a func­tion of ). It is an­a­lytic in for some in­ter­val and there­fore can be de­scribed by a Tay­lor ex­pan­sion around in­side this in­ter­val.

Note that for op­ti­mal poli­cies, Propo­si­tion 4 holds for a sim­pler rea­son. Speci­fi­cally, the op­ti­mal policy is piece­wise con­stant (since it’s de­ter­minis­tic) and there is a Black­well policy i.e. a fixed policy which is op­ti­mal for any suffi­ciently close to 1. And, it is easy to see that for a con­stant policy, the value is a ra­tio­nal func­tion of . On the other hand, the quan­tilal policy is not guaran­teed to be lo­cally con­stant any­where. Nev­er­the­less the quan­tilum value is still piece­wise ra­tio­nal.

Fi­nally, we ad­dress the ques­tion of the com­pu­ta­tional com­plex­ity of quan­tiliza­tion. We prove the following

Propo­si­tion 5

As­sume ge­o­met­ric time dis­count. As­sume fur­ther that , , , , and are ra­tio­nal num­bers. Then:

a. There is an al­gorithm for com­put­ing which runs in time polyno­mial in the size of the in­put , , , , and . Also, if is sta­tion­ary and are ra­tio­nal, then are also ra­tio­nal and can be com­puted from , , and in polyno­mial time.

b. Given an ad­di­tional ra­tio­nal in­put pa­ram­e­ter , there is an al­gorithm for com­put­ing a sta­tion­ary policy which is an -equil­ibrium in the zero-sum game as­so­ci­ated with quan­tiliza­tion, which runs in time polyno­mial in the size of the in­put and .

EDIT: In fact, it is pos­si­ble to do bet­ter and com­pute an ex­act quan­tilal policy in polyno­mial time.

Fu­ture Directions

To tease a lit­tle, here are some de­vel­op­ments of this work that I’m plan­ning:

  • Ap­ply quan­tiliza­tion to re­in­force­ment learn­ing, i.e. when we don’t know the MDP in ad­vance. In par­tic­u­lar, I be­lieve that the prin­ci­ples of quan­tiliza­tion can be used not only to deal with mis­speci­fied re­wards, but also to deal with traps to some ex­tent (as­sum­ing it a pri­ori known that the refer­ence policy has at most a small prob­a­bil­ity of fal­ling into a trap). This has some philo­soph­i­cal im­pli­ca­tions on how hu­mans avoid traps.

  • Unify that for­mal­ism with DRL. The role of the refer­ence policy will be played by the ad­vi­sor (thus the refer­ence policy is not known in ad­vance but is learned on­line). This means we can drop the san­ity con­di­tion for the ad­vi­sor, at the price of (i) hav­ing a re­gret bound defined w.r.t. some kind of quan­tilum value rather than op­ti­mal value (ii) hav­ing a term in the re­gret bound pro­por­tional to the (pre­sum­ably small) rate of fal­ling into traps when fol­low­ing the refer­ence (ad­vi­sor) policy. It should be pos­si­ble to fur­ther de­velop that by unify­ing it with the ideas of catas­tro­phe DRL.

  • Deal with more gen­eral en­vi­ron­ments, e.g. POMDPs and con­tin­u­ous state spaces.


Propo­si­tion A.1

Proof of Propo­si­tion A.1

The defi­ni­tion of im­plies that


Ob­serve that

It fol­lows that for any that satis­fies the con­straint , we have and therefore

To show that the in­equal­ity can be ar­bi­trar­ily close to equal­ity, choose s.t. and . If , we can define by

Clearly and . We get

In the case , we can take any and define by

Clearly and . We get

Since , we can make this ex­pres­sion ar­bi­trar­ily low and there­fore the in­fi­mum is which is what we need since in this case .

Propo­si­tion 1 now fol­lows im­me­di­ately from Propo­si­tion A.1.

We will use the no­ta­tion (the space of all poli­cies). We also have (the space of Markov poli­cies) and (the space of sta­tion­ary poli­cies). Mildly abus­ing no­ta­tion, we will view as a sub­space of and as a sub­space of .

Propo­si­tion A.2

For any , there ex­ists some policy which is quan­tilal rel­a­tively to .

Proof of Propo­si­tion A.2

is the product of a countable num­ber of copies of (in­dexed by ). is nat­u­rally a topolog­i­cal space (a sim­plex), and we can thereby equip by the product topol­ogy. By Ty­chonoff’s the­o­rem, is com­pact. More­over, it is easy to see that is a con­tin­u­ous map­ping. Fi­nally, ob­serve that is lower semi­con­tin­u­ous in (since it is the max­i­mum of a num­ber of con­tin­u­ous func­tions) and there­fore is up­per semi­con­tin­u­ous in . By the ex­treme value the­o­rem, it fol­lows that a quan­tilal policy ex­ists.

For any , we define by

Proof of Propo­si­tion 2

By Propo­si­tion A.2, there is a quan­tilal policy . We define by

We now prove by in­duc­tion that for any , .

For , we have .

For any and any , we have

To com­plete the in­duc­tion step, ob­serve that, by defi­ni­tion of

Now, for any , and there­fore . We con­clude that is also quan­tilal.

Proof of Propo­si­tion 3

By Propo­si­tion 2, there is which is a Markov quan­tilal policy. We have

Tak­ing the ex­pected value of this equa­tion w.r.t. we get

Also, we have

It fol­lows that

Define by

We get

Define the lin­ear op­er­a­tor by the matrix

View­ing as a sub­set of , we get

On the other hand, we have

There­fore, and is also quan­tilal.

Propo­si­tion A.3

As­sume ge­o­met­ric time dis­count. Con­sider any . Define the lin­ear op­er­a­tors and by the matrices

Define the lin­ear op­er­a­tor by

Define by

Define as follows

Here, vec­tor in­equal­ities are un­der­stood to be com­po­nen­t­wise.


In par­tic­u­lar, if , the above ex­pres­sion de­scribes

Proof of Propo­si­tion A.3

Denote . As usual in ex­ten­sive-form games, be­hav­ioral strate­gies are equiv­a­lent to mixed strate­gies and there­fore the image of the push­for­ward op­er­a­tor is the same as the image of . It fol­lows that

is a com­pact Pol­ish space (in the sense of the product topol­ogy) and there­fore is com­pact in the weak topol­ogy. By Sion’s min­i­max theorem

Now con­sider any . im­plies (look­ing at the com­po­nent) that

There­fore, there is some s.t.

The above is the Bel­l­man equa­tion for a mod­ified MDP with re­ward func­tion . Denote the ver­sion of the op­er­a­tor for the ini­tial state (in­stead of ). We get

Ob­serv­ing that , we get

On the other hand, for any s.t. and (these in­equal­ities cor­re­spond to the com­po­nent of ), there is s.t.

Namely, this is the solu­tion of the Bel­l­man equa­tion for the re­ward func­tion . There­fore, we have

Tak­ing the in­fi­mum of both sides over in­side the do­main we get

Proof of Propo­si­tion 4

Con­sider the char­ac­ter­i­za­tion of in Propo­si­tion A.3. By gen­eral prop­er­ties of sys­tems of lin­ear in­equal­ities, there is some s.t.

Here, the no­ta­tion means tak­ing the sub­ma­trix of con­sist­ing of the rows cor­re­spond­ing to . Similarly, is the sub­vec­tor of con­sist­ing of the com­po­nents cor­re­spond­ing to .

(To see this, use the fact that the min­i­mum of a lin­ear func­tion on a con­vex set is always at­tained on the bound­ary, and ap­ply in­duc­tion by di­men­sion.)

More­over, has to be a sin­gle point . In­deed, if it has more than one point then it con­tains a straight line . The pro­jec­tion of on the sec­ond has to be a sin­gle point , oth­er­wise some point in vi­o­lates the in­equal­ity which would con­tra­dict . There­fore, the pro­jec­tion of on the first is also a straight line . As in the proof of Propo­si­tion A.3, for any , is an up­per bound on the value of state in the MDP with re­ward func­tion . In par­tic­u­lar, if then . How­ever, since is a line, there must be some s.t. can be any real num­ber for some . This is a con­tra­dic­tion.

Denote the pro­jec­tion op­er­a­tor on the first di­rect sum­mand. It fol­lows that we can always choose s.t. and we have

For each , the ex­pres­sion on the right hand side is a ra­tio­nal func­tion of and the ma­trix el­e­ments of and which, in turn, are polyno­mi­als in the pa­ram­e­ters the de­pen­dence on which we are try­ing to es­tab­lish. There­fore, this ex­pres­sion is a ra­tio­nal func­tion in those pa­ram­e­ters (un­less van­ishes iden­ti­cally, in which case this can ig­nored). So, always equals one of a finite set of ra­tio­nal func­tions (cor­re­spond­ing to differ­ence choices of ). The con­ti­nu­ity of also eas­ily fol­lows from its char­ac­ter­i­za­tion as .

Propo­si­tion A.4

As­sume ge­o­met­ric time dis­count and sta­tion­ary . Then, is a ra­tio­nal func­tion of , , and with ra­tio­nal co­effi­cients.

Proof of Propo­si­tion A.4

Define the lin­ear op­er­a­tor by the matrix

We have

Proof of Corol­lary 1

By Propo­si­tions 4 and A.4, is a con­tin­u­ous piece­wise ra­tio­nal func­tion of with a finite num­ber of pieces. Two ra­tio­nal func­tions can only co­in­cide at a finite num­ber of points (since a polyno­mial can only have a finite num­ber of roots), there­fore there is only a finite num­ber of val­ues of in which “switches” from one ra­tio­nal func­tion to an­other. It fol­lows that there is some s.t. is a fixed ra­tio­nal func­tion of in the in­ter­val .

More­over, it always holds that

The first in­equal­ity holds since, the guaran­teed perfor­mance of the quan­tilal policy is at least as good as the guaran­teed perfor­mance of . The sec­ond in­equal­ity is a con­se­quence of the re­quire­ment that is non-nega­tive.

It fol­lows that can­not have a pole at and there­fore must con­verge to a finite value there.

Proof of Propo­si­tion 5

The al­gorithm for is ob­tained from Propo­si­tion A.3 us­ing lin­ear pro­gram­ming.

The claim re­gard­ing for sta­tion­ary fol­lows from Propo­si­tion A.4.

We now de­scribe the al­gorithm for com­put­ing an -equil­ibrium.

For any , define by

Con­sider any . We define by

Let be the -op­er­a­tor for the MDP with tran­si­tion ker­nel , and define

It is easy to see that if an only if there is quan­tilal s.t. . In­deed, the MDP with ker­nel is equiv­a­lent to the origi­nal MDP un­der the con­straint that, when in state , any ac­tion has to be taken with the min­i­mal prob­a­bil­ity . In par­tic­u­lar (since we con­strained the pos­si­ble poli­cies but not the pos­si­ble penalties). So, if as above ex­ists, then we can use it to con­struct a sta­tion­ary policy for the new MDP with guaran­teed value . Con­versely, if the new MDP has a sta­tion­ary policy with guaran­teed value then it can be used to con­struct the nec­es­sary .

Us­ing Propo­si­tion A.3, we can com­pute for any ra­tio­nal by lin­ear pro­gram­ming. This al­lows us to find the de­sired policy by bi­nary search on , one (state,ac­tion) pair at a time.

No comments.