Stable Pointers to Value III: Recursive Quantilization

This is a very loose idea.

In Stable Poin­t­ers to Value II, I pointed at a loose hi­er­ar­chy of ap­proaches in which you try to get rid of the wire­head­ing prob­lem by re­vis­ing the feed­back loop to re­move the in­cen­tive to wire­head. Each re­vi­sion seems to change the na­ture of the prob­lem (per­haps to the point where we don’t want to call it a wire­head­ing prob­lem, and in­stead would put it in a more gen­eral per­verse-in­stan­ti­a­tion cat­e­gory), but not elimi­nate prob­lems en­tirely.

Talk­ing with Lawrence Chan to­day, he de­scribed a way of solv­ing prob­lems by “go­ing meta” (a strat­egy which he was mostly sus­pi­cious of, in the con­ver­sa­tion). His ex­am­ple was: you can’t ex­tract hu­man val­ues by spec­i­fy­ing it as a learn­ing prob­lem, be­cause of se­vere iden­ti­fi­a­bil­ity prob­lems. How­ever, it is not en­tirely im­plau­si­ble that we can “learn to learn hu­man val­ues”: have hu­mans la­bel ex­am­ples of other hu­mans try­ing to do things, in­di­cat­ing what val­ues are be­ing ex­pressed in the sce­nario.

If this goes wrong, you can try and iter­ate the op­er­a­tion again, learn­ing to learn to learn...

This struck me as similar to the hi­er­ar­chy I had con­structed in my older post.

My in­ter­pre­ta­tion of what Lawrence meant by “go­ing meta” here is this: ma­chine learn­ing re­search “eats” other re­search fields by us­ing au­to­mated learn­ing to solve prob­lems which were pre­vi­ously be­ing solved by the pro­cess of sci­ence, IE, hand-craft­ing hy­pothe­ses and test­ing them. AI al­ign­ment re­search is full of cases where this doesn’t seem like a very good ap­proach. How­ever, one at­ti­tude we can take to such cases is to do the op­er­a­tion again: pro­pose to learn how hu­mans would solve this sticky prob­lem.

This is not at all like other learn­ing to learn ap­proaches which merely seek to speed up nor­mal learn­ing. The idea is that our ob­ject-level loss func­tion is in­suffi­cient to point out the be­hav­ior we re­ally want. We want new nor­ma­tive feed­back to come in at the meta-level, tel­ling us more about which ways of solv­ing the ob­ject-level prob­lem are de­sir­able and which are un­de­sir­able.

The idea I’m about to de­scribe seems like a fairly hope­less idea, but I’m in­ter­ested in see­ing how it would go re­gard­less.

What is the fixed point of this par­tic­u­lar “go meta” op­er­a­tion?

The in­tu­ition is this: any util­ity func­tion we try to write down has per­verse in­stan­ti­a­tions, so that we don’t re­ally want to op­ti­mize it fully. Search­ing over a big space leads to Good­hart and op­ti­miza­tion dae­mons. Un­for­tu­nately, search is more or less the only way to pro­duce in­tel­li­gent be­hav­ior that we know of.

How­ever, it seems like we can of­ten im­prove on this situ­a­tion by pro­vid­ing more hu­man in­put to check what was re­ally wanted. Fur­ther­more, it seems like we gen­er­ally get more by do­ing this on the meta level—we don’t just want to re­fine the es­ti­mated util­ity func­tion; we want to re­fine our no­tion of safely search­ing for good op­tions (avoid­ing searches which good­hart on looks-good-to-hu­mans by ma­nipu­lat­ing hu­man psy­chol­ogy, for ex­am­ple), re­fine our no­tion of what learn­ing the util­ity func­tion even means, and so on.

Every stage of go­ing meta in­tro­duces a need for yet an­other search, which brings back the prob­lems all over again. But, maybe we can do some­thing in­ter­est­ing by jump­ing up all the meta lev­els here, so that each search is it­self gov­erned by some feed­back, ex­cept when we bot­tom out in ex­tremely sim­ple op­er­a­tions which we trust.

(This feels con­cep­tu­ally similar to some in­tu­itions in HCH/​IDA, but I don’t see that it is ex­actly the same.)

Re­cur­sive Quantilization

“Re­cur­sive quan­tiliza­tion” is an at­tempt to make the idea a lit­tle more for­mal. I don’t think it quite cap­tures ev­ery­thing I would want from the “fixed point of the meta op­er­a­tion Lawrence Chan was sus­pi­cious of”, but it has the ad­van­tage of be­ing slightly more con­crete.

Quan­tiliz­ers are a way of op­ti­miz­ing a util­ity func­tion when you’re sus­pi­cious that it isn’t the “true” util­ity func­tion you should be op­ti­miz­ing, but you do think that the av­er­age differ­ence is low when sam­pling things from a known back­ground dis­tri­bu­tion. In­tu­itively, you don’t want to move too far from the back­ground dis­tri­bu­tion where your util­ity es­ti­mates are ac­cu­rate, but you do want to op­ti­mize in the di­rec­tion of high util­ity some­what.

What if we want to quan­tilize, and we ex­pect that there is some back­ground dis­tri­bu­tion which would make us have a de­cent amount of trust in the ac­cu­racy of the given util­ity func­tion, but we don’t know what that back­ground dis­tri­bu­tion is?

We have to learn the “safe” back­ground dis­tri­bu­tion.

Learn­ing is go­ing to re­quire a search for hy­pothe­ses match­ing what­ever feed­back we get, which re-in­tro­duces Good­hart, etc. So, we quan­tilize that search. But we need a back­ground dis­tri­bu­tion which we ex­pect to be safe. And so on.

  • You start with very broad pri­ors on what back­ground dis­tri­bu­tions might be safe, so you barely op­ti­mize at all, but have some de­fault (hu­man-pro­grammed) strat­egy of ask­ing hu­mans ques­tions.

  • You en­gage in ac­tive learn­ing, steer­ing to­ward ques­tions which re­solve the most im­por­tant am­bi­gui­ties (to the ex­tent that you’re will­ing to steer).

  • Be­cause we are tak­ing “all the meta lev­els” here, we can do some amount of gen­er­al­iza­tion across meta lev­els so that the stack of meta doesn’t get out of hand. In other words, we’re ac­tu­ally learn­ing one safe back­ground dis­tri­bu­tion for all the meta lev­els, which en­codes some­thing like a hu­man con­cept of “non-fishy” ways of go­ing about things.


There are a lot of po­ten­tial con­cerns here, but the one which is most salient to me is that hu­mans will have a lot of trou­ble pro­vid­ing feed­back about non-fishy ways of solv­ing the prob­lems at even slightly high meta lev­els.

Ob­ject level: Plans for achiev­ing high util­ity.

Meta 1: Distri­bu­tions con­tain­ing only plans which the util­ity func­tion eval­u­ates cor­rectly.

Meta 2: Distri­bu­tions con­tain­ing only dis­tri­bu­tions-on-plans which the first-meta-level learn­ing al­gorithm can be ex­pected to eval­u­ate cor­rectly.

Et cetera.

How do you an­a­lyze a dis­tri­bu­tion? Pre­sum­ably you have to get a good pic­ture of its shape in the highly mul­ti­di­men­sional space—look at ex­am­ples of more and less typ­i­cal mem­bers, and be con­vinced that the ex­am­ples you looked at were rep­re­sen­ta­tive. It’s also im­por­tant that you go into its code and check that it isn’t in­tel­li­gently op­ti­miz­ing for some mis­al­igned goal.

It seems to me that a mas­sive ad­vance in trans­parency or in­formed over­sight would be needed in or­der for hu­mans to give helpful feed­back at higher meta-lev­els.

No nominations.
No reviews.