Does Bayes Beat Goodhart?

Stu­art Arm­strong has claimed to beat Good­hart with Bayesian un­cer­tainty—rather than as­sum­ing some par­tic­u­lar ob­jec­tive func­tion (which you try to make as cor­rect as pos­si­ble), you rep­re­sent some un­cer­tainty. A similar claim was made in The Op­ti­mizer’s Curse and How to Beat It, the es­say which in­tro­duced a lot of us to … well, not Good­hart’s Law it­self (the post doesn’t make men­tion of Good­hart), but, that kind of failure. I my­self claimed that Bayes beats re­gres­sional Good­hart, in Ro­bust Del­e­ga­tion:

I now think this isn’t true—Bayes’ Law doesn’t beat Good­hart fully. It doesn’t even beat re­gres­sional Good­hart fully. (I’ll prob­a­bly edit Ro­bust Del­e­ga­tion to change the claim at some point.)

(Stu­art makes some more de­tailed claims about AI and the near­est-un­blocked-strat­egy prob­lem which aren’t ex­actly claims about Good­hart, at least ac­cord­ing to him. I don’t fully un­der­stand Stu­art’s per­spec­tive, and don’t claim to di­rectly ad­dress it here. I am mostly only ad­dress­ing the ques­tion of the ti­tle of my post: does Bayes beat Good­hart?)

If ap­prox­i­mate solu­tions are con­cern­ing, why would mix­tures of them be un­con­cern­ing?

My first ar­gu­ment is a loose in­tu­ition: Good­har­tian phe­nom­ena sug­gest that some­what-cor­rect-but-not-quite-right proxy func­tions are not safe to op­ti­mize (and in some sense, the more op­ti­miza­tion pres­sure is ap­plied, the less safe we ex­pect it to be). As­sign­ing weights to a bunch of some­what-but-not-quite-right pos­si­bil­ities just gets us an­other some­what-but-not-quite-right pos­si­bil­ity. Why would we ex­pect this to fun­da­men­tally solve the prob­lem?

  • Per­haps the Bayesian mix­ture across hy­pothe­ses is closer to be­ing cor­rect, and there­fore, gives us an ap­prox­i­ma­tion which is able to stand up to more op­ti­miza­tion pres­sure be­fore it breaks down. But this is a quan­ti­ta­tive dis­tinc­tion, not a qual­i­ta­tive one. How big of a differ­ence do we ex­pect that to make? Wouldn’t it still break down about as badly when put un­der tremen­dous op­ti­miza­tion pres­sure?

  • Per­haps the point of the Bayesian mix­ture is that, by quan­tify­ing un­cer­tainty about the var­i­ous hy­pothe­ses, it en­courages strate­gies which hedge their bets—satis­fy­ing a broad range of pos­si­ble util­ity func­tions, by avoid­ing do­ing some­thing ter­rible for one util­ity func­tion in or­der to get a few more points for an­other. But this in­cen­tive to hedge bets is fairly weak; the op­ti­miza­tion is still en­couraged to do some­thing re­ally ter­rible for one func­tion if it leads to a mod­er­ate in­crease for many other util­ity func­tions.

My in­tu­ition there doesn’t ad­dress the gears of the situ­a­tion ad­e­quately, though. Let’s get into it.

Over­com­ing re­gres­sional Good­hart re­quires cal­ibrated learn­ing.

In Ro­bust Del­e­ga­tion, I defined re­gres­sional Good­hart through the pre­dictable-dis­ap­point­ment idea. Does Bayesian rea­son­ing elimi­nate pre­dictable dis­ap­point­ment?

Well, it de­pends on what is meant by “pre­dictable”. You could define it as pre­dictable-by-bayes, in which case it fol­lows that Bayes solves the prob­lem. How­ever, I think it is rea­son­able to at least add a cal­ibra­tion re­quire­ment: there should be no way to sys­tem­at­i­cally cor­rect es­ti­mates up or down as a func­tion of the ex­pected value.

Cal­ibra­tion seems like it does, in fact, sig­nifi­cantly ad­dress re­gres­sional Good­hart. You can’t have seen a lot of in­stances of an es­ti­mate be­ing too high, and still ac­cept that too-high es­ti­mate. It doesn’t ad­dress ex­tremal Good­hart, be­cause cal­ibrated learn­ing can only guaran­tee that you even­tu­ally cal­ibrate, or con­verge at some rate, or some­thing like that—ex­treme val­ues that you’ve rarely en­coun­tered would re­main a con­cern.

(Stu­art’s “one-in-three” ex­am­ple in the Defeat­ing Good­hart post, and his dis­cus­sion of hu­man over­con­fi­dence more gen­er­ally, is some­what sug­ges­tive of cal­ibra­tion.)

Bayesian meth­ods are not always cal­ibrated. Cal­ibrated learn­ing is not always Bayesian. (For ex­am­ple, log­i­cal in­duc­tion has good cal­ibra­tion prop­er­ties, and so far, hasn’t got­ten a re­ally satis­fy­ing Bayesian treat­ment.)

This might be con­fus­ing if you’re used to think­ing in Bayesian terms. If you think in terms of the di­a­gram I copied from Ro­bust Del­e­ga­tion, above: you have a prior which stipu­lates prob­a­bil­ity of true util­ity given ob­ser­va­tion ; your ex­pec­ta­tion is the ex­pected value of for a par­tic­u­lar value of ; is not pre­dictably cor­rectable with re­spect to your prior. What’s the prob­lem?

The prob­lem is that this line of rea­son­ing as­sumes that your prior is ob­jec­tively cor­rect. This doesn’t gen­er­ally make sense (es­pe­cially from a Bayesian per­spec­tive). So, it is perfectly con­sis­tent for you to col­lect many ob­ser­va­tions, and see that has some sys­tem­atic bias. This may re­main true even as you up­date on those ob­ser­va­tions (be­cause Bayesian learn­ing doesn’t guaran­tee any cal­ibra­tion prop­erty in gen­eral!).

The faulty as­sump­tion that your prob­a­bil­ity dis­tri­bu­tion is cor­rect is of­ten re­placed with the (weaker, but still prob­le­matic) as­sump­tion that at least one hy­poth­e­sis within your dis­tri­bu­tion is ob­jec­tively cor­rect—the re­al­iz­abil­ity as­sump­tion.

Bayesian solu­tions as­sume re­al­iz­abil­ity.

As dis­cussed in Embed­ded World Models, the re­al­iz­abil­ity as­sump­tion is the as­sump­tion that (at least) one of your hy­pothe­ses rep­re­sents the true state of af­fairs. Bayesian meth­ods of­ten (though not always) re­quire a re­al­iz­abil­ity as­sump­tion in or­der to get strong guaran­tees. Fre­quen­tist meth­ods rarely re­quire such an as­sump­tion (what­ever else you may say about fre­quen­tist meth­ods). Cal­ibra­tion is an ex­am­ple of that—a Bayesian can get cal­ibra­tion un­der the as­sump­tion of re­al­iz­abil­ity, but, we might want a stronger guaran­tee of cal­ibra­tion which holds even in ab­sence of re­al­iz­abil­ity.

“We quan­tified our un­cer­tainty as best we could!”

One pos­si­ble bayes-beats-good­hart ar­gu­ment is: “Once we quan­tify our un­cer­tainty with a prob­a­bil­ity dis­tri­bu­tion over pos­si­ble util­ity func­tions, the best we can pos­si­bly do is to choose what­ever max­i­mizes ex­pected value. Any­thing else is de­ci­sion-the­o­ret­i­cally sub-op­ti­mal.”

Do you think that the true util­ity func­tion is re­ally sam­pled from the given dis­tri­bu­tion, in some ob­jec­tive sense? And the prob­a­bil­ity dis­tri­bu­tion also quan­tifies all the things which can count as ev­i­dence? If so, fine. Max­i­miz­ing ex­pec­ta­tion is the ob­jec­tively best strat­egy. This elimi­nates all types of Good­hart by posit­ing that we’ve already mod­eled the pos­si­bil­ities suffi­ciently well: ex­tremal cases are mod­eled cor­rectly; ad­ver­sar­ial effects are already ac­counted for; etc.

How­ever, this is un­re­al­is­tic due to em­bed­ded­ness: the out­side world is much more com­pli­cated than any prob­a­bil­ity dis­tri­bu­tion which we can ex­plic­itly use, since we are our­selves a small part of that world.

Alter­na­tively, do you think the prob­a­bil­ity dis­tri­bu­tion re­ally cod­ifies your pre­cise sub­jec­tive un­cer­tainty? Ok, sure, that would also jus­tify the ar­gu­ment.

Real­is­ti­cally, though, an im­ple­men­ta­tion of this isn’t go­ing to be rep­re­sent­ing your pre­cise sub­jec­tive be­liefs (to the ex­tent you even have pre­cise sub­jec­tive be­liefs). It has to hope to have a prior which is “good enough”.

In what sense might it be “good enough”?

An ob­vi­ous prob­lem is that a dis­tri­bu­tion might be over­con­fi­dent in a wrong con­clu­sion, which will ob­vi­ously be bad. The fix for this ap­pears to be: make sure that the dis­tri­bu­tion is “suffi­ciently broad”, ex­press­ing a fairly high amount of un­cer­tainty. But, why would this be good?

Well, one might ar­gue: it can only be worse that our true un­cer­tainty to the ex­tent that it ends up as­sign­ing too lit­tle weight to the cor­rect op­tion. So, if the prob­a­bil­ity func­tion isn’t too small for any of the pos­si­bil­ities which we in­tu­itively as­sign non-neg­ligible weight, things should be fine.

“The True Utility Func­tion Has Enough Weight”

First, even as­sum­ing the fram­ing of “true util­ity func­tion” makes sense, it isn’t ob­vi­ous to me that the ar­gu­ment makes sense.

If there’s a true util­ity func­tion which is as­signed weight , and we ap­ply a whole lot of op­ti­miza­tion pres­sure to the over­all mix­ture dis­tri­bu­tion, then it is perfectly pos­si­ble that gets com­pro­mised for the sake of satis­fy­ing a large num­ber of other . The weight de­ter­mines a ra­tio at which trade-offs can oc­cur, not a ra­tio of the over­all re­sources which we will get or any­thing like that.

A first-pass anal­y­sis is that has to be more than 12 to guaran­tee any con­sid­er­a­tion; any weight less than that, and it’s pos­si­ble that is as low as it can go in the op­ti­mized solu­tion, be­cause some out­come was suffi­ciently good for all other po­ten­tial util­ity func­tions that it made sense to “take the hit” with re­spect to . We can’t for­mally say “this prob­a­bly won’t hap­pen, be­cause the odds that the best-look­ing op­tion is speci­fi­cally ter­rible for are low” with­out as­sum­ing some­thing about the dis­tri­bu­tion of highly op­ti­mized solu­tions.

(Such an anal­y­sis might be in­ter­est­ing; I don’t know if any­one has in­ves­ti­gated from that an­gle. But, it seems some­what un­likely to do us good, since it doesn’t seem like we can make very nice as­sump­tions about what highly-op­ti­mized solu­tions look like.)

In re­al­ity, the worst-case anal­y­sis is bet­ter than this, be­cause many of the more-plau­si­ble should have a lot of “over­lap” with ; af­ter all, they were given high weight be­cause they ap­peared plau­si­ble some­how (they agreed with hu­man in­tu­itions, or pre­dicted hu­man be­hav­ior, etc). We could try to for­mally define “over­lap” and see what as­sump­tions we need to guaran­tee bet­ter-than-worst-case out­comes. (This might have some in­ter­est­ing learn­ing-the­o­retic im­pli­ca­tions for value learn­ing, even.)

How­ever, this whole fram­ing, where we as­sume that there’s a and think about its weight, is sus­pect. Why should we think that there’s a “true” util­ity func­tion which cap­tures our prefer­ences? And, if there is, why should we as­sume that it has an ex­plicit rep­re­sen­ta­tion in the hy­poth­e­sis space?

If we drop this as­sump­tion, we get the clas­si­cal prob­lems as­so­ci­ated with non-re­al­iz­abil­ity in Bayesian learn­ing. Beliefs may not con­verge at all, as ev­i­dence ac­cu­mu­lates; they could keep os­cillat­ing due to in­con­sis­tent ev­i­dence. Un­der the in­ter­pre­ta­tion where we still as­sume a “true” util­ity func­tion but we don’t as­sume that it is ex­plic­itly rep­re­sentable within the hy­poth­e­sis space, there isn’t a clear guaran­tee we can get (al­though per­haps the “over­lap” anal­y­sis can help here). If we don’t as­sume a true util­ity func­tion at all, then it isn’t clear how to even ask ques­tions about how well we do (al­though I’m not say­ing there isn’t a use­ful anal­y­sis—I’m just say­ing that it is un­clear to me right now).

Stu­art does ad­dress this ques­tion, in the end:

I’ve ar­gued that an in­de­scrib­able hel­l­world can­not ex­ist. There’s a similar ques­tion as to whether there ex­ists hu­man un­cer­tainty about U that can­not be in­cluded in the AI’s model of Δ. By defi­ni­tion, this un­cer­tainty would be some­thing that is cur­rently un­known and uni­mag­in­able to us. How­ever, I feel that it’s far more likely to ex­ist, than the in­de­scrib­able hel­l­world.
Still de­spite that is­sue, it seems to me that there are meth­ods of deal­ing with the Good­hart prob­lem/​near­est un­blocked strat­egy prob­lem. And this in­volves prop­erly ac­count­ing for all our un­cer­tainty, di­rectly or in­di­rectly. If we do this well, there no longer re­mains a Good­hart prob­lem at all.

Per­haps I agree, if “prop­erly ac­count­ing for all our un­cer­tainty” in­cludes ro­bust­ness prop­er­ties such as cal­ibrated learn­ing, and if we re­strict our at­ten­tion to re­gres­sional Good­hart, ig­nor­ing the other three.

Well… what about the oth­ers, then?

Over­com­ing ad­ver­sar­ial Good­hart seems to re­quire ran­dom­iza­tion.

The ar­gu­ment here is pretty sim­ple: ad­ver­sar­ial Good­hart en­ters into the do­main of game the­ory, in which mixed strate­gies tend to be very use­ful. Quan­tiliza­tion is one such mixed strat­egy, which seems to use­fully ad­dress Good­hart to a cer­tain ex­tent. I’m not say­ing that quan­tiliza­tion is the ul­ti­mate solu­tion here. But, it does seem to me like quan­tiliza­tion is sig­nifi­cant enough that a solu­tion to Good­hart should say some­thing about the class of prob­lems which quan­tiliza­tion solves.

In par­tic­u­lar, a prop­erty of quan­tiliza­tion which I find ap­peal­ing is the way more cer­tainty about the util­ity func­tion im­plies that more op­ti­miza­tion power can be safely ap­plied to mak­ing de­ci­sions. This in­forms my in­tu­ition that ap­ply­ing ar­bi­trar­ily high op­ti­miza­tion power does not be­come safe sim­ply be­cause you’ve ex­plic­itly rep­re­sented un­cer­tainty about util­ity func­tions—no mat­ter how ac­cu­rately, short of “perfectly ac­cu­rately” (which isn’t even a mean­ingful con­cept), it only seems to jus­tify a limited amount of op­ti­miza­tion pres­sure. This story may be an in­cor­rect one, but if so, I’d like to re­ally un­der­stand why it is in­cor­rect.

Un­like the pre­vi­ous sec­tions, this doesn’t nec­es­sar­ily step out­side of typ­i­cal Bayesian thought, since this kind of game-the­o­retic think­ing is more or less within the purview of Bayesi­anism. How­ever, the sim­ple “Bayes solves Good­hart” story doesn’t ex­plic­itly ad­dress this.

(I haven’t ad­dressed causal Good­hart any­where in this es­say, since it opens up the whole de­ci­sion-the­o­retic can of worms, which seems some­what beside the main point. (I sup­pose, ar­guably, game-the­o­retic con­cerns could be beside the point as well—but, they feel more di­rectly rele­vant to me, since quan­tiliza­tion is fairly di­rectly about solv­ing Good­hart.))

In sum­mary:

  • If op­ti­miz­ing an ar­bi­trary some­what-but-not-perfectly-right util­ity func­tion gives rise to se­ri­ous Good­hart-re­lated con­cerns, then why does a mix­ture dis­tri­bu­tion over such func­tions alle­vi­ate such con­cerns? Aren’t they just av­er­ag­ing to­gether to yield yet an­other some­what-but-not-quite-right func­tion?

  • Re­gres­sional Good­hart seems bet­ter-ad­dressed by cal­ibrated learn­ing than it does by Bayesian learn­ing.

  • Bayesian learn­ing tends to re­quire a re­al­iz­abil­ity as­sump­tion in or­der to have good prop­er­ties (in­clud­ing cal­ibra­tion).

  • Even as­sum­ing re­al­iz­abil­ity, heav­ily op­ti­miz­ing a mix­ture dis­tri­bu­tion over pos­si­ble util­ity func­tions seems dicey—it can end up throw­ing away all the real value if it finds a way to jointly satisfy a lot of the wrong ones. (It is pos­si­ble that we can find rea­son­able as­sump­tions un­der which this doesn’t hap­pen, how­ever.)

  • Over­com­ing ad­ver­sar­ial Good­hart seems to re­quire mixed strate­gies, which the sim­ple “bayesian un­cer­tainty” story doesn’t ex­plic­itly ad­dress.