# Stable Pointers to Value III: Recursive Quantilization

This is a very loose idea.

In Stable Poin­t­ers to Value II, I pointed at a loose hi­er­ar­chy of ap­proaches in which you try to get rid of the wire­head­ing prob­lem by re­vis­ing the feed­back loop to re­move the in­cen­tive to wire­head. Each re­vi­sion seems to change the na­ture of the prob­lem (per­haps to the point where we don’t want to call it a wire­head­ing prob­lem, and in­stead would put it in a more gen­eral per­verse-in­stan­ti­a­tion cat­e­gory), but not elimi­nate prob­lems en­tirely.

Talk­ing with Lawrence Chan to­day, he de­scribed a way of solv­ing prob­lems by “go­ing meta” (a strat­egy which he was mostly sus­pi­cious of, in the con­ver­sa­tion). His ex­am­ple was: you can’t ex­tract hu­man val­ues by spec­i­fy­ing it as a learn­ing prob­lem, be­cause of se­vere iden­ti­fi­a­bil­ity prob­lems. How­ever, it is not en­tirely im­plau­si­ble that we can “learn to learn hu­man val­ues”: have hu­mans la­bel ex­am­ples of other hu­mans try­ing to do things, in­di­cat­ing what val­ues are be­ing ex­pressed in the sce­nario.

If this goes wrong, you can try and iter­ate the op­er­a­tion again, learn­ing to learn to learn...

This struck me as similar to the hi­er­ar­chy I had con­structed in my older post.

My in­ter­pre­ta­tion of what Lawrence meant by “go­ing meta” here is this: ma­chine learn­ing re­search “eats” other re­search fields by us­ing au­to­mated learn­ing to solve prob­lems which were pre­vi­ously be­ing solved by the pro­cess of sci­ence, IE, hand-craft­ing hy­pothe­ses and test­ing them. AI al­ign­ment re­search is full of cases where this doesn’t seem like a very good ap­proach. How­ever, one at­ti­tude we can take to such cases is to do the op­er­a­tion again: pro­pose to learn how hu­mans would solve this sticky prob­lem.

This is not at all like other learn­ing to learn ap­proaches which merely seek to speed up nor­mal learn­ing. The idea is that our ob­ject-level loss func­tion is in­suffi­cient to point out the be­hav­ior we re­ally want. We want new nor­ma­tive feed­back to come in at the meta-level, tel­ling us more about which ways of solv­ing the ob­ject-level prob­lem are de­sir­able and which are un­de­sir­able.

The idea I’m about to de­scribe seems like a fairly hope­less idea, but I’m in­ter­ested in see­ing how it would go re­gard­less.

What is the fixed point of this par­tic­u­lar “go meta” op­er­a­tion?

The in­tu­ition is this: any util­ity func­tion we try to write down has per­verse in­stan­ti­a­tions, so that we don’t re­ally want to op­ti­mize it fully. Search­ing over a big space leads to Good­hart and op­ti­miza­tion dae­mons. Un­for­tu­nately, search is more or less the only way to pro­duce in­tel­li­gent be­hav­ior that we know of.

How­ever, it seems like we can of­ten im­prove on this situ­a­tion by pro­vid­ing more hu­man in­put to check what was re­ally wanted. Fur­ther­more, it seems like we gen­er­ally get more by do­ing this on the meta level—we don’t just want to re­fine the es­ti­mated util­ity func­tion; we want to re­fine our no­tion of safely search­ing for good op­tions (avoid­ing searches which good­hart on looks-good-to-hu­mans by ma­nipu­lat­ing hu­man psy­chol­ogy, for ex­am­ple), re­fine our no­tion of what learn­ing the util­ity func­tion even means, and so on.

Every stage of go­ing meta in­tro­duces a need for yet an­other search, which brings back the prob­lems all over again. But, maybe we can do some­thing in­ter­est­ing by jump­ing up all the meta lev­els here, so that each search is it­self gov­erned by some feed­back, ex­cept when we bot­tom out in ex­tremely sim­ple op­er­a­tions which we trust.

(This feels con­cep­tu­ally similar to some in­tu­itions in HCH/​IDA, but I don’t see that it is ex­actly the same.)

## Re­cur­sive Quantilization

“Re­cur­sive quan­tiliza­tion” is an at­tempt to make the idea a lit­tle more for­mal. I don’t think it quite cap­tures ev­ery­thing I would want from the “fixed point of the meta op­er­a­tion Lawrence Chan was sus­pi­cious of”, but it has the ad­van­tage of be­ing slightly more con­crete.

Quan­tiliz­ers are a way of op­ti­miz­ing a util­ity func­tion when you’re sus­pi­cious that it isn’t the “true” util­ity func­tion you should be op­ti­miz­ing, but you do think that the av­er­age differ­ence is low when sam­pling things from a known back­ground dis­tri­bu­tion. In­tu­itively, you don’t want to move too far from the back­ground dis­tri­bu­tion where your util­ity es­ti­mates are ac­cu­rate, but you do want to op­ti­mize in the di­rec­tion of high util­ity some­what.

What if we want to quan­tilize, and we ex­pect that there is some back­ground dis­tri­bu­tion which would make us have a de­cent amount of trust in the ac­cu­racy of the given util­ity func­tion, but we don’t know what that back­ground dis­tri­bu­tion is?

We have to learn the “safe” back­ground dis­tri­bu­tion.

Learn­ing is go­ing to re­quire a search for hy­pothe­ses match­ing what­ever feed­back we get, which re-in­tro­duces Good­hart, etc. So, we quan­tilize that search. But we need a back­ground dis­tri­bu­tion which we ex­pect to be safe. And so on.

• You start with very broad pri­ors on what back­ground dis­tri­bu­tions might be safe, so you barely op­ti­mize at all, but have some de­fault (hu­man-pro­grammed) strat­egy of ask­ing hu­mans ques­tions.

• You en­gage in ac­tive learn­ing, steer­ing to­ward ques­tions which re­solve the most im­por­tant am­bi­gui­ties (to the ex­tent that you’re will­ing to steer).

• Be­cause we are tak­ing “all the meta lev­els” here, we can do some amount of gen­er­al­iza­tion across meta lev­els so that the stack of meta doesn’t get out of hand. In other words, we’re ac­tu­ally learn­ing one safe back­ground dis­tri­bu­tion for all the meta lev­els, which en­codes some­thing like a hu­man con­cept of “non-fishy” ways of go­ing about things.

## Issues

There are a lot of po­ten­tial con­cerns here, but the one which is most salient to me is that hu­mans will have a lot of trou­ble pro­vid­ing feed­back about non-fishy ways of solv­ing the prob­lems at even slightly high meta lev­els.

Ob­ject level: Plans for achiev­ing high util­ity.

Meta 1: Distri­bu­tions con­tain­ing only plans which the util­ity func­tion eval­u­ates cor­rectly.

Meta 2: Distri­bu­tions con­tain­ing only dis­tri­bu­tions-on-plans which the first-meta-level learn­ing al­gorithm can be ex­pected to eval­u­ate cor­rectly.

Et cetera.

How do you an­a­lyze a dis­tri­bu­tion? Pre­sum­ably you have to get a good pic­ture of its shape in the highly mul­ti­di­men­sional space—look at ex­am­ples of more and less typ­i­cal mem­bers, and be con­vinced that the ex­am­ples you looked at were rep­re­sen­ta­tive. It’s also im­por­tant that you go into its code and check that it isn’t in­tel­li­gently op­ti­miz­ing for some mis­al­igned goal.

It seems to me that a mas­sive ad­vance in trans­parency or in­formed over­sight would be needed in or­der for hu­mans to give helpful feed­back at higher meta-lev­els.

No nominations.
No reviews.
• Ul­ti­mately I think you’ll en­counter a difficulty here due to epistemic cir­cu­lar­ity: you’ll even­tu­ally need to know about a dis­tri­bu­tion you can’t go more meta on be­cause it would be func­tion­ally equiv­a­lent to solv­ing the prob­lem of the crite­rion, dis­cov­er­ing com­pletely the uni­ver­sal prior, ground­ing in­duc­tion in gen­eral, etc.. Not that we don’t always have to deal with it, just that in par­tic­u­lar I don’t ex­pect go­ing meta to help much be­yond re­duc­ing the num­ber of free vari­ables you have to con­sider. That be­ing said, get­ting the num­ber of free vari­ables you have to think about down is helpful, but you’ll still be left with them.

• I’m not even sure there is good nor­ma­tive feed­back on the meta level(s). There is feed­back we can give on the meta level for any par­tic­u­lar ob­ject-level in­stance, but it seems not at all ob­vi­ous (to me) that this ad­vice will gen­er­al­ize well to other ob­ject-level in­stances.

On the other hand, it does seem to me that the higher up you are in meta-lev­els, the smaller the space of con­cepts and the eas­ier it is to learn. So maybe my over­all take is that it seems like we can’t de­pend on hu­mans to give meta-level feed­back well, but if we can figure out how to ei­ther give bet­ter feed­back or learn from noisy feed­back, it would be eas­ier to learn and likely gen­er­al­ize bet­ter.

• I share both of these in­tu­itions.

That be­ing said, I’m not con­vinced that the space of con­cepts is smaller as you get more meta. (Naively speak­ing, there are ~ex­po­nen­tially more dis­tri­bu­tions over dis­tri­bu­tions than dis­tri­bu­tions, though some strong sim­plic­ity bi­ases can cut this down a lot.) I sus­pect that one rea­son it seems that the space of con­cepts is “smaller” is be­cause we’re worse at differ­en­ti­at­ing con­cepts at higher lev­els of meta-ness. For ex­am­ple, it seems that it’s of­ten eas­ier to figure out what the con­se­quences of con­crete ac­tion X are than the con­se­quences of adopt­ing a par­tic­u­lar eth­i­cal sys­tem, and a lot of philos­o­phy on metaethics seems more con­fused than philos­o­phy on ethics. I think this is re­lated to the “it’s more difficult to get feed­back” in­tu­ition, where we have fewer dis­tinct buck­ets be­cause it’s too hard to dis­t­in­guish be­tween similar the­o­ries at suffi­ciently high meta-lev­els.

• Yeah, I think I agree with all of that. Per­haps the bet­ter way to state my po­si­tion is, con­di­tional on there be­ing good nor­ma­tive feed­back on the meta level, I would ex­pect the space of con­cepts on the meta-level to be smaller than on the ob­ject-level.