Defeating Goodhart and the “closest unblocked strategy” problem

This post is longer and more self-con­tained than my re­cent stubs.

tl;dr: Patches such as tel­ling the AI “avoid X” will re­sult in Good­hart’s law and the near­est un­blocked strat­egy prob­lem: the AI will do al­most ex­actly what it was go­ing to do, ex­cept nar­rowly avoid­ing the spe­cific X.

How­ever, if the patch can re­placed with “I am tel­ling you to avoid X”, and this is treated as in­for­ma­tion about what to avoid, and the bi­ases and nar­row­ness of my rea­son­ing are cor­rectly taken into ac­count, these prob­lems can be avoided. The im­por­tant thing is to cor­rectly model my un­cer­tainty and over­con­fi­dence.

AIs don’t have a Good­hart prob­lem, not exactly

The prob­lem of an AI max­imis­ing a proxy util­ity func­tion seems similar to the Good­hart Law prob­lem, but isn’t ex­actly the same thing.

The stan­dard Good­hart law is a prin­ci­pal-agent prob­lem: the prin­ci­pal P and the agent A both know, roughly, what the prin­ci­pal’s util­ity is (eg aims to cre­ate a suc­cess­ful com­pany). How­ever, fulfilling is difficult to mea­sure, so a mea­surable proxy is used in­stead (eg aims to max­imise share price). Note that the prin­ci­pal and the agents goals are mis­al­igned, and the mea­surable serves to (try to) bring them more into al­ign­ment.

For an AI, the prob­lem is not that is hard to mea­sure, but that it is hard to define. And the AI’s goals are : there is no need to make mea­surable, it is not a check on the AI, but the AI’s in­trin­sic mo­ti­va­tion.

This may seem like a small differ­ence, but it has large con­se­quences. We could give an AI a , our “best guess” at , while also in­clud­ing all our un­cer­tainty about how to define . This op­tion is not available for the prin­ci­pal agent prob­lem, since giv­ing a com­pli­cated goal to a more knowl­edge­able agent just gives it more op­por­tu­ni­ties to mis­be­have: we can’t rely on it max­imis­ing the goal, we have to check that it does so.

Overfit­ting to the patches

There is a cer­tain similar­ity with many ma­chine learn­ing tech­niques. Neu­ral nets that dis­t­in­guish cats and dogs could treat any “dog” photo as a spe­cific patch that can be routed around. In that case, the net would define “dog” as “any­thing al­most iden­ti­cal to the dog pho­tos I’ve been trained on”, and “cat” as “any­thing else”.

And that would be a ter­rible de­sign; for­tu­nately, mod­ern ma­chine learn­ing gets around the prob­lem by, in effect, as­sign­ing un­cer­tainty cor­rectly: “dog” is not seen as the ex­act set of dog pho­tos in the train­ing set, but as a larger, more neb­u­lous con­cept, of which the spe­cific dog pho­tos are just ex­am­ples.

Similarly, we could define as , where is our best at­tempt at spec­i­fy­ing , and en­codes the fact that is but an ex­am­ple our im­perfect minds have come up with, to try and cap­ture . We know that is over­sim­plified, and is an en­cod­ing of this fact. If a neu­ral net could syn­the­sis a de­cent es­ti­mate of “dog” from some ex­am­ples, could it syn­the­sis “friendli­ness” from our at­tempts to define it?

The idea is best ex­plained through an ex­am­ple.

Ex­am­ple: Don’t crush the baby or the other objects

This sec­tion will pre­sent a bet­ter ex­am­ple, I be­lieve, than the origi­nal one pre­sented here.

A robot ex­ists in a grid world:

The robot’s aim is to get to the goal square, with the flag. It gets a penalty of for each turn it isn’t there.

If that were the only re­ward, the robot’s ac­tions would be dis­as­trous:

So we will give it a penalty of for run­ning over ba­bies. If we do so, we will get a Good­hart/​near­est un­blocked strat­egy be­havi­our:

Oops! Turns out we val­ued those vases as well.

What we want the AI to learn is not that the baby is speci­fi­cally im­por­tant, but that the baby is an ex­am­ple of im­por­tant things it should not crush. So imag­ine it is con­fronted by the fol­low­ing, which in­cludes six types of ob­jects, of un­known value:

In­stead of hav­ing hu­mans hand-la­bel each item, we in­stead gen­er­al­ise from some hand-la­bel­led ex­am­ples, us­ing rules of ex­trap­o­la­tion and some ma­chine learn­ing. This tells the AI that, typ­i­cally, we value about one-in-six ob­jects, and value them at a tenth of the value of ba­bies (hence it gets for run­ning one over). Given that, the best policy, with an ex­pected re­ward of , is:

This be­havi­our is already much bet­ter than we would ex­pect from a typ­i­cal Good­hart law-style agent (and we could com­pli­cate the ex­am­ple to make the differ­ence more em­phatic).

Ex­am­ple: hu­man over-confidence

The above works if we hu­mans cor­rectly ac­count for our un­cer­tainty—if we not only pro­duce , but also a cor­rect for how good a match we ex­pect be­tween and .

But we hu­mans are of­ten over­con­fi­dent in their es­ti­mates, es­pe­cially in our es­ti­mates of value. We are far bet­ter at hind­sight (“you shouldn’t have crushed the vase”) than at fore­sight (“here’s a com­plete list of what you shouldn’t do”). Even know­ing that hind­sight is bet­ter, doesn’t make the is­sue go away.

This is similar to the plan­ning fal­lacy. That fal­lacy means that we un­der­es­ti­mate the time taken to com­plete tasks—even if we try to take the plan­ning fal­lacy into ac­count.

How­ever, the plan­ning fal­lacy can be solved us­ing the out­side view: com­par­ing the pro­ject to similar pro­jects, rather than us­ing de­tailed in­ner knowl­edge.

Similarly, hu­man over­con­fi­dence can be solved by the AI not­ing our ini­tial es­ti­mates, our cor­rec­tions to those ini­tial es­ti­mates, our cor­rec­tions tak­ing into ac­count the pre­vi­ous cor­rec­tions, our at­tempts to take into ac­count all pre­vi­ous re­peated cor­rec­tions—and the failure of those at­tempts.

Sup­pose, for ex­am­ple, that hu­mans, in hind­sight, value one-in-three of the typ­i­cal ob­jects in the grid world. We start out with an es­ti­mate of one-in-twelve; af­ter the robot mashes a bit too many of the ob­jects, we up­date to one-in-nine; af­ter be­ing re­peat­edly told that we un­der­es­ti­mate our hind­sight, we up­date to one-in-six… and stay there.

But mean­while, the robot can still see that we con­tinue to un­der­es­ti­mate, and goes di­rectly to a one-in-three es­ti­mate; so with new, un­known ob­jects, it will only risk crush­ing a sin­gle one:

If the robot learnt that we val­ued even more ob­jects (or val­ued some of them more than ), it would then de­fault to the safest, longest route:


In prac­tice, of course, the robot will also be get­ting in­for­ma­tion about what types of ob­jects we value, but the gen­eral les­son still ap­plies: the robot can learn that we un­der­es­ti­mate un­cer­tainty, and in­crease its own un­cer­tainty in con­se­quence.

Full un­cer­tainty, very un­known unknowns

So, this is a more for­mal ver­sion of ideas I posted a while back. The pro­cess could be seen as:

  1. Give the AI as our cur­rent best es­ti­mate for .

  2. En­code our known un­cer­tain­ties about how well re­lates to .

  3. Have the AI de­duce, from our sub­se­quent be­havi­our, how well we have en­coded our un­cer­tain­ties, and change these as needed.

  4. Re­peat 2-3 for differ­ent types of un­cer­tain­ties.

What do I mean by “differ­ent types” of un­cer­tainty? Well, the ex­am­ple above was sim­ple: the model had but a sin­gle un­cer­tainty, over the pro­por­tion of typ­i­cal ob­jects that we val­ued. The AI learnt that we sys­tem­at­i­cally un­der­es­ti­mated this, even when it helped us try and do bet­ter.

But there are other types of un­cer­tain­ties that could hap­pen. We value some ob­jects more than oth­ers, but maybe these es­ti­mates are not ac­cu­rate ei­ther. Maybe we are fine as long as one ob­ject of a type ex­ists, and don’t care about the other—or, con­versely, maybe some ob­jects are only valuable in pairs. The AI needs a rich enough model to be able to ac­count for these ex­tra types of prefer­ences, that we may not have ever ar­tic­u­lated ex­plic­itly.

There are even more ex­am­ples as we move from grid­wor­lds into the real world. We can ar­tic­u­late ideas like “hu­man value is frag­ile” and maybe give an es­ti­mate of the to­tal com­plex­ity of hu­man val­ues. And then the agent could use ex­am­ples to es­ti­mate the qual­ity of our es­ti­mate, and come up with bet­ter num­ber for the de­sired com­plex­ity.

But “hu­man value is frag­ile” is a rel­a­tively re­cent in­sight. There was time when peo­ple hadn’t ar­tic­u­lated that idea. So it’s not that we didn’t have a good es­ti­mate for the com­plex­ity of hu­man val­ues; we didn’t have any idea that was a good thing to es­ti­mate.

The AI has to figure out the un­known un­knowns. Note that, un­like the value syn­the­sis pro­ject, the AI doesn’t need to re­solve this un­cer­tainty; it just needs to know that it ex­ists, and give a good-enough es­ti­mate of it.

The AI will cer­tainly figure out some un­known un­knowns (and un­known knowns): it just has to spot some pat­terns and con­nec­tions we were un­aware of. But in or­der to get all of them, the AI has to have some sort of max­i­mal model in which all our un­cer­tainty (and all our mod­els) can be con­tained.

Just con­sider some of the con­cepts I’ve come up with (I chose these be­cause I’m most fa­mil­iar with them; LessWrong abounds with other ex­am­ples): siren wor­lds, hu­mans mak­ing similar nor­ma­tive as­sump­tions about each other, and the web of con­no­ta­tions.

In the­ory, each of these should have re­duced my un­cer­tainty, and moved closer to . In prac­tice, each of these has in­creased my es­ti­mate of un­cer­tainty, by show­ing how much re­mains to be done. Could an AI have taken these effects cor­rectly into ac­count, given that these three ex­am­ples are of very differ­ent types? Can it do so for dis­cov­er­ies that re­main to be made?

I’ve ar­gued that an in­de­scrib­able hel­l­world can­not ex­ist. There’s a similar ques­tion as to whether there ex­ists hu­man un­cer­tainty about that can­not be in­cluded in the AI’s model of . By defi­ni­tion, this un­cer­tainty would be some­thing that is cur­rently un­known and uni­mag­in­able to us. How­ever, I feel that it’s far more likely to ex­ist, than the in­de­scrib­able hel­l­world.

Still de­spite that is­sue, it seems to me that there are meth­ods of deal­ing with the Good­hart prob­lem/​near­est un­blocked strat­egy prob­lem. And this in­volves prop­erly ac­count­ing for all our un­cer­tainty, di­rectly or in­di­rectly. If we do this well, there no longer re­mains a Good­hart prob­lem at all.