When to use quantilization

In 2015, Jes­sica in­tro­duced quan­tiliza­tion as a coun­ter­mea­sure for Good­hart’s Law and speci­fi­ca­tion-gam­ing. Since these are such cen­tral prob­lems in AI safety, I con­sider quan­tiliza­tion to be one of the best in­no­va­tions in AI safety so far, but it has re­ceived lit­tle at­ten­tion from the AI safety field. I think one rea­son for this is that re­searchers aren’t quite clear what prob­lem, for­mally, quan­tiliza­tion solves, that other al­gorithms don’t. So in this piece, I define a ro­bust re­ward prob­lem, and then dis­cuss when quan­tiliza­tion solves this prob­lem well, and when it doesn’t.

Defi­ni­tion 1 (Ro­bust re­ward prob­lem)

We can define a ro­bust re­ward prob­lem as a tu­ple: .

is the ac­tion space

is the ex­plicit reward

is the space of im­plicit re­wards, a fam­ily of func­tions

is a dis­tri­bu­tion over actions

is the max­i­mum im­plicit loss in the demonstrations

The goal of the agent is to max­i­mize for any . If that was the end of the defi­ni­tion, the task would be too difficult, be­cause an ad­ver­sar­ial re­ward could thwart any strat­egy. So we need to as­sume that is pretty well-be­haved in the re­gion .

For­mally, the goal is to se­lect a strat­egy that max­i­mizes the fol­low­ing:

The in­tu­ition be­hind (1) is that a teacher may for­get to in­clude some pos­si­ble failures in their re­ward func­tion, but they ought not to leave out the same mis­takes that they them­selves fre­quently make. And at any rate, with­out (1), as­sur­ing good perfor­mance is im­pos­si­ble.

When quan­tiliza­tion works

You can skip this sec­tion if you’re fa­mil­iar with Jes­sica’s work

If we set , we re­cover the origi­nal set­ting in which quan­tiliz­ers were de­scribed. Then (as V Kosoy has ar­gued) the or­di­nary quan­tilizer the­o­rems mean that we get the best lower bound for .

We will need to reuse the defi­ni­tions, so to briefly re­cap:

Defi­ni­tion 2 (Quan­tilizer). A q-quan­tilizer is an agent that, when faced with a de­ci­sion prob­lem, re­turns a ran­dom ac­tion in the top q pro­por­tion of the base dis­tri­bu­tion , sorted by the ex­plicit ex­pected util­ity achieved if that ac­tion is ex­e­cuted.

Jes­sica proves three things. First, that a quan­tilizer does not have much worse im­plicit loss than the base dis­tri­bu­tion :

Where is the dis­tri­bu­tion over ac­tions se­lected by the quan­tilizer.

Se­cond, she proves that no other strat­egy can get a lower-bound bet­ter than:

(There is no lower bound if there ex­ists such that .)

This means the quan­tilizer has the best guaran­tee that can be ob­tained. Ba­si­cally, any strat­egy that puts much more weight on any ac­tion can be defeated by an ad­ver­sar­ial loss func­tion that places all of the loss there. In a con­tin­u­ous ac­tion space, this is re­ally bad. It means sim­ply op­ti­miz­ing can lead to ar­bi­trar­ily bad val­ues of and hence , be­cause the ac­tion you choose might con­tain un­limited coun­terfeit util­ity. More sur­pris­ingly, the same is true for agents that max­i­mize a mixed re­ward, such as be­cause these mixed re­ward agents also tend to land on one fixed ac­tion. In the liter­a­ture, there are many vari­a­tions on this theme. One can start with imi­tat­ing, and then op­ti­mize and imi­tate in al­ter­nat­ing steps, and so on. But any com­mon scheme, and any re­motely similar scheme can land on a fixed ac­tion and have un­bounded loss.

Any­way, the third thing that Jes­sica proves is that if you can’t let be too high any­where, then the best thing you can do from the point of view of ob­tain­ing is to se­lect the top ac­tions ac­cord­ing to with some prob­a­bil­ity and to place prob­a­bil­ity on all other ac­tions, in other words, to quan­tilize. The im­por­tant thing to no­tice is that our -guaran­tee is in terms of , and given a cer­tain , the most is ob­tained by quan­tiliz­ing. As V Kosoy has said, quan­tiliz­ing with differ­ent val­ues of carves out the pareto fron­tier of and guaran­teed- , and so the max­i­mum guaran­teed-V is also ob­tained by quan­tiliz­ing. Speci­fi­cally, one is guaran­teed

For this kind of ro­bust re­ward prob­lem, quan­tiliza­tion does best. Imi­ta­tion, which is the sub­type of quan­tiliza­tion with ob­tains a guaran­tee on , whereas op­ti­miza­tion has that is bounded above, and so is doomed!

So Jes­sica tells a com­pel­ling story about one par­tic­u­lar situ­a­tion (with ), where ac­tions can­not be (much) bet­ter than ex­pected, but can be much worse. But what if we re­move that as­sump­tion by choos­ing a value of ?

When op­ti­miza­tion and imi­ta­tion work bet­ter than quantilization

Sup­pose that . Ac­tions can be ar­bi­trar­ily good, but losses are bounded. In this situ­a­tion, any strat­egy has a bound on , which is . Given this, you might as well pick the best ac­tion ev­ery time, giv­ing the fol­low­ing guaran­tee:

Alter­na­tively, you can imi­tate, and get the guaran­tee:

Which course of ac­tion is prefer­able de­pends on the val­ues of and .

Sup­pose al­ter­na­tively that . Then, the losses of an op­ti­mizer are un­bounded, but the losses of any -quan­tilizer with are un­bounded too. In this case, the only way to get any lower-bound on your losses is to just imi­tate the given dis­tri­bu­tion, hav­ing your strat­egy be equal to the base dis­tri­bu­tion . Then you ob­tain the bound

How it can de­pend on U

Now that we’ve cov­ered a few val­ues of , let’s dou­ble down on con­sid­er­ing . In this sce­nario, we said that quan­tiliza­tion is op­ti­mal, but we didn’t yet say whether the best form of quan­tiliza­tion might have (op­ti­miza­tion) or (imi­ta­tion).

In­tu­itively, if U is com­pletely flat, it makes sense to perform imi­ta­tion, be­cause di­verg­ing from has some­thing to lose but noth­ing to gain. Con­versely, if is zero (there is no hid­den loss), then one can op­ti­mize, so long as one is con­strained to the sup­port of , be­cause the hid­den loss is zero for that set of ac­tions. But can we say more?

The case of op­ti­miza­tion is pretty clear-cut, be­cause for any , an in­finite amount of loss is in­curred. Op­ti­miza­tion would only max­i­mize if in­creases faster than hy­per­bol­i­cally as . Ba­si­cally, this would re­quire that di­verged, which would be a re­ally patholog­i­cal situ­a­tion. So we can ba­si­cally rule out op­ti­miza­tion for , .

What about imi­ta­tion? Imi­ta­tion will be op­ti­mal for many re­ward func­tions. Ba­si­cally, de­creas­ing just in­creases while de­creas­ing . If there ex­ists some sweet-spot of ac­tions that are pretty com­mon but sub­stan­tially out­perform imi­ta­tion, then quan­tiliza­tion will be best, and oth­er­wise, the best ap­proach is imi­ta­tion.

Which set of as­sump­tions best matches re­al­ity?

In gen­eral, our ac­tions, or those of an AI can bring as­tro­nom­i­cal benefits or harms, so or is un­re­al­is­tic.

When train­ing for a fully gen­eral, au­tonomous task, it is apt to model the sce­nario as , be­cause the demon­strated ac­tions could have com­plex down­stream effects (see Tay­lor on “but­terfly effects”) that bear out on the whole light cone. But at least, in this set­ting, we can take con­so­la­tion that imi­ta­tion is the­o­ret­i­cally safe, and try to ad­vance pro­jects like brain-em­u­la­tion and fac­tored cog­ni­tion, that would imi­tate hu­man rea­son­ing. The dis­ad­van­tage of these pro­pos­als is that they ba­si­cally can only make speed-su­per­in­tel­li­gence, rather than qual­ity-su­per­in­tel­li­gence.

The ques­tion of whether Jes­sica’s as­sump­tion of is a rea­son­able model for some tasks is in­ter­est­ing. Fol­low­ing Jes­sica, we need (1) it to be suffi­cient to perform some small fac­tor bet­ter than a hu­man demon­stra­tor, (2) for the hu­man to en­code all im­por­tant in­for­ma­tion ei­ther in the ex­plicit util­ity func­tion or in the demon­stra­tions, and (3) for the AI sys­tem not to de­crease the fre­quency of as­tro­nom­i­cal, un­seen pos­i­tive im­pacts. For quan­tiliza­tion to do any bet­ter than imi­ta­tion, we do also need (4) to have suffi­cient slope that it is worth­while to de­vi­ate from imi­ta­tion. It would cer­tainly be nice if some re­al­is­tic, and po­ten­tially pivotal tasks could be quan­tilized, but I think the jury is still out, and now pri­mar­ily awaits ex­per­i­men­tal in­ves­ti­ga­tion.