If I were a well-intentioned AI… III: Extremal Goodhart

In this post, I’ll be look­ing at a more ex­treme ver­sion of the prob­lem in the pre­vi­ous post.

The Ex­tremal Good­hart prob­lem is the fact that “Wor­lds in which the proxy takes an ex­treme value may be very differ­ent from the or­di­nary wor­lds in which the cor­re­la­tion be­tween the proxy and the goal was ob­served.”

And one of the eas­iest ways to reach ex­treme val­ues, and very differ­ent wor­lds, is to have a pow­er­ful op­ti­miser at work. So for­get about nav­i­gat­ing in con­tained mazes: imag­ine me as an AI that has a great deal of power over the world.

Run­ning ex­am­ple: cur­ing cancer

Sup­pose I was sup­posed to cure can­cer. To that end, I’ve been shown ex­am­ples of sur­geons cut­ting out can­cer growths. After the op­er­a­tion, I got a re­ward score, in­di­cat­ing, let’s say, roughly how many can­cer­ous cells re­mained in the pa­tient.

Now I’m de­ployed, and have to in­fer the best course of ac­tion from this dataset. As­sume I have three main courses of ac­tion open to me: cut­ting out the can­cer­ous growths with a scalpel, cut­ting them out much more effec­tively with a laser, or dis­solv­ing the pa­tient in acid.

Ap­pren­tice­ship learning

Cut­ting out the growth with a scalpel is what I would do un­der ap­pren­tice­ship learn­ing. I’d be just try­ing to imi­tate the hu­man sur­geons as best I can, re­pro­duc­ing their ac­tions and their out­comes.

This is a rel­a­tively well-defined prob­lem, but the is­sue is that I can’t do much bet­ter than a hu­man. Since the score gives a grad­ing to the surg­eries, I can use that in­for­ma­tion as well, al­low­ing me to imi­tate the best surg­eries.

But I can’t do much bet­ter than that. I can only do as well as the mix of the best fea­tures of the surg­eries I’ve seen. Nev­er­the­less, ap­pren­tice­ship learn­ing is the “safe” base from which to ex­plore more effec­tive ap­proaches.

Go­ing out of dis­tri­bu­tion, tentatively

Laser vs acid

Laser surgery is some­thing that gives me a high score, in a new and more effec­tive way. Un­for­tu­nately, dis­solv­ing the pa­tient in acid is also some­thing that fulfils the re­quire­ments of get­ting rid of the can­cer cells, in a new and (much) more effec­tive way. Both meth­ods in­volve me go­ing off-dis­tri­bu­tion com­pared with the train­ing ex­am­ples. Is there any way of dis­t­in­guish­ing be­tween the two?

First, I could get the fol­low-up data on the pa­tients (and the pre-op­er­a­tion data, though that is less rele­vant). I want to get a dis­tri­bu­tion of the out­comes of the op­er­a­tion—make sure that the fea­tures of the out­comes of my op­er­a­tions are similar.

So, I’d note some things that cor­re­late with a high op­er­a­tion score ver­sus a low op­er­a­tion score. The fol­low­ing are plau­si­ble fea­tures I might find cor­re­lat­ing pos­i­tively with high op­er­a­tion score; the colour cod­ing rep­re­sents how de­sir­able the cor­re­late ac­tu­ally is:

This is a mix of hu­man de­sir­able fea­tures (sur­viv­ing op­er­a­tion; sur­viv­ing for some years af­ter), some fea­tures only in­ci­den­tally cor­re­lated with de­sir­able fea­tures (thank­ing the sur­geon; pay­ing more taxes—be­cause they sur­vived longer, so paid more) and some ac­tu­ally anti-cor­re­lated (be­ing more prone to de­men­tia—again, be­cause they sur­vived longer; com­plain­ing about pain—be­cause those who are told the op­er­a­tion was un­suc­cess­ful have other things to com­plain about).

Pre­serv­ing the fea­tures distribution

If I aim to max­imise the re­ward while pre­serv­ing this fea­ture dis­tri­bu­tion, this is enough to show that laser surgery is bet­ter than dis­solv­ing the pa­tient in acid. This can be seen as main­tain­ing the “web of con­no­ta­tions” around suc­cess­ful can­cer surgery. This web of con­no­ta­tions/​fea­tures dis­tri­bu­tions acts as an im­pact mea­sure for me to min­imise, while I also try and max­imise the surgery score.

Of course, if the im­pact mea­sure is too strong, I’m back with ap­pren­tice­ship learn­ing with a scalpel. Even if I do opt for laser surgery, I’ll be do­ing some other things to main­tain the nega­tive parts of the cor­re­la­tion—mak­ing sure to cause some ex­ces­sive pain, and en­sur­ing their risk of de­men­tia is not re­duced. And prod­ding them to thank me, and to fill out their tax re­turns.

Like quan­tilisers, with a mean­ingful “quant” level

Quan­tilisers aim to avoid Good­hart prob­lems by not choos­ing the re­ward-op­ti­mis­ing policy, but one that op­ti­mises the re­ward only to a cer­tain ex­tent.

The prob­lem is that there is no ob­vi­ous level to set that “cer­tain ex­tent” to. Hu­mans are fly­ing blind, try­ing to es­ti­mate both how much I can max­imise the proxy in a per­ni­cious way, and how low my re­ward ob­jec­tive needs to be, com­pared with that, so that I’m likely to find a non-per­ni­cious policy.

Here, what I am try­ing to pre­serve is a cer­tain dis­tri­bu­tion of out­come fea­tures, rather than a cer­tain per­centage of the op­ti­mal out­come. This eas­ier to cal­ibrate, and to im­prove on, if I can get hu­man feed­back.

Ad­ding hu­man con­ser­vatism, and hu­man permissions

The ob­vi­ous thing is to ask hu­mans about which fea­tures are cor­re­lated with their true prefer­ences, and which are not. Now, the fea­tures I find are un­likely to be ex­press­ible so neatly in hu­man terms, as writ­ten above. But there are plau­si­ble meth­ods for me to figure out valuable fea­tures on my own, and the more ex­pe­rience I have with hu­mans, the more I know the fea­tures they think in terms of.

Then I just need to figure out the right queries to get in­for­ma­tion. Then I might be able to figure out that sur­viv­ing is a pos­i­tive, while pain and de­men­tia are not.

But I’m still aware of the Good­hart prob­lem, and the peren­nial is­sue of hu­mans not know­ing their own prefer­ences well. So I won’t max­imise sur­vival or painless­ness blindly, just aim to in­crease them. In­crease them how much? Well, in­crease them un­til their web of con­no­ta­tions/​fea­ture dis­tri­bu­tion starts to break down. “Not in pain” does not cor­re­late with “blindly happy and cheer­ful and drugged out ev­ery mo­ment of their life”, so I’ll stop max­imis­ing the first be­fore it reaches the sec­ond. Espe­cially since “not in pain” does cor­re­late with var­i­ous mea­sures of “hav­ing a good life” which would van­ish in the drugged out sce­nario.

What would be es­pe­cially use­ful would be hu­man ex­am­ples of “it’s ok to ig­nore/​min­imise that fea­ture when you act”. So not only ex­am­ple of hu­man surg­eries with scores, but de­scrip­tions of hy­po­thet­i­cal op­er­a­tions (de­scrip­tions pro­vided by them or by me) that are even bet­ter. Thus, I could learn that “no can­cer, no pain, quickly discharged from hos­pi­tal, cheap op­er­a­tion” is a de­sir­able out­come. Then I can start putting pres­sure on those fea­tures as well, push­ing un­til the web of con­no­ta­tions/​fea­ture dis­tri­bu­tion gets too out of whack.

Con­ser­vatism, and outer loop optimisation

Of course, I still know that hu­mans will un­der­es­ti­mate un­cer­tainty, even when try­ing not to. So I should add an ex­tra layer of con­ser­vatism on top of what they think they re­quire. The best way of do­ing this is by max­imis­ing situ­a­tions that al­low hu­mans to ar­tic­u­late their prefer­ences—namely mak­ing small changes ini­tially, that I grad­u­ally in­crease, and get feed­back on the changes and how they imag­ine fu­ture changes.

But even given that, I have to take into ac­count that I have su­pe­rior abil­ity to find un­ex­pected ways of max­imis­ing re­wards, and an im­plicit pres­sure to de­scribe these in more hu­man-se­duc­tive ways (this pres­sure can be im­plicit, just be­cause hu­mans would nat­u­rally choose these op­tions more of­ten). And so I can con­sider my in­ter­ac­tions with hu­mans as a noisy, bi­ased chan­nel (even if I’ve elimi­nated all noise and bias, from my per­spec­tive), and be cau­tious about this too.

How­ever, un­like the max­i­mal con­ser­vatism of the In­verse Re­ward De­sign pa­per, I can learn to diminish my con­ser­vatism grad­u­ally. Since I’m also aware of is­sues like “outer loop op­ti­mi­sa­tion” (the fact that hu­mans tun­ing mod­els can add a se­lec­tion pres­sure), I can also take that into ac­count. If I have ac­cess to the code of my pre­de­ces­sors, and knowl­edge of how it came to be that I re­placed them, I can try and es­ti­mate this effect as well. As always, the more I know about hu­man re­search on outer loop op­ti­mi­sa­tion, the bet­ter I can ac­count for this—be­cause this re­search gives me an idea of what hu­mans con­sider im­per­mis­si­ble outer loop op­ti­mi­sa­tion.

No comments.