Goodhart Taxonomy: Agreement

“When a mea­sure be­comes a tar­get, it ceases to be a good mea­sure.”
-Good­hart’s Law

If you spend a while talk­ing with some­one you dis­agree with and end up agree­ing, this is a sign you are both rea­son­ing and com­mu­ni­cat­ing well—one of the pri­mary uses of good rea­son­ing is re­solv­ing dis­agree­ment. How­ever, if you use agree­ment as your main proxy for good rea­son­ing, some bad things might hap­pen.

Scott Garrabrant has helpfully laid out four differ­ent mod­els of how things can go wrong if you op­ti­mise for the proxy re­ally hard, a phe­nomenon known as ‘good­hart­ing’ (based on Good­hart’s Law that any proxy when op­ti­mised for hard, stops be­ing a good proxy). I want to take a look at each model and see what it pre­dicts for the real world, in the do­main of agree­ment.


First, you can fall prey to re­gres­sional good­hart­ing. This is when the proxy you’re op­ti­mis­ing for is a good mea­sure of the thing you ac­tu­ally care about, but plus some noise (i.e. other un­cor­re­lated vari­ables), and the ex­am­ples that max­imise the sum of these are the ex­am­ples that max­imise the noise. I can think of three ways this could hap­pen: mi­s­un­der­stand­ing, spu­ri­ous cor­re­la­tion, and shared back­ground mod­els.

Mi­sun­der­stand­ing is the sim­ple idea that, of the times when you most agree with some­one, you mi­s­un­der­stood each other (e.g. were us­ing words differ­ently). Espe­cially if it’s an im­por­tant topic and most peo­ple dis­agree with you, sud­denly the one per­son who gets you seems to be the best rea­soner you know (if you’re re­gres­sional good­hart­ing).

Spu­ri­ous cor­re­la­tion is like I de­scribed in A Sketch of Good Com­mu­ni­ca­tion—two peo­ple who have differ­ent AI timelines can keep pro­vid­ing each other with new ev­i­dence un­til they have the same 50th per­centile date, but it may turn out that they have wholly differ­ent causal mod­els be­hind, and thus don’t mean­ingfully agree around AI x-risk. This is differ­ent from mi­s­un­der­stand­ing, be­cause you heard the per­son cor­rectly when they stated their be­lief.

And shared back­ground mod­els hap­pens like this: You de­cide that the peo­ple who are good rea­son­ers and com­mu­ni­ca­tors are those who agree with you a lot on com­plex is­sues af­ter dis­agree­ing ini­tially. Often this heuris­tic ends up find­ing peo­ple who are good at un­der­stand­ing your point of view, and up­dat­ing when you make good ar­gu­ments. But if you look at the peo­ple who agree with you the most of these peo­ple, you’ll tend to start find­ing peo­ple who share a lot of back­ground mod­els. “Oh, you’ve got a PhD in eco­nomics too? Well ob­vi­ously you can see that these two elas­tic goods are on the pareto fron­tier if there’s an effi­cient mar­ket. Ex­actly. We’re so good at com­mu­ni­cat­ing!”


Se­cond, you can fall prey to ex­tremal good­hart­ing. This is where the peaks of your heuris­tic are ac­tu­ally fal­ling out of an en­tirely differ­ent pro­cess, and have no bear­ing at all on the thing you cared about. Here’s some things you might ac­tu­ally get if you fol­lowed the heuris­tic ‘agrees with me’ to its ex­treme:

  • A mirror

  • A ser­vice sec­tor worker whose rule is ‘the cus­tomer is always right’

  • Some­one who trusts you a lot per­son­ally and so be­lieves what you say is true

  • A part­ner who likes the sound of your voice and knows say­ing ‘yes, go on’ causes you to talk a lot

  • An iden­ti­cal copy of you (e.g. an em­u­lated mind)

While these don’t seem like prac­ti­cal mis­takes any of us would make, I sup­pose it’s a good skill to be able to know the literal max­i­mum of the func­tion you wrote down. It can help you to not build the wrong AGI, for ex­am­ple.


But there is one par­tic­u­larly com­mon type of pro­cess that can end up be­ing spu­ri­ously high in your proxy: our third type, ad­ver­sar­ial good­hart­ing. This is where some­one no­tices that you’ve con­nected your proxy to a de­ci­sion over a large amount of re­sources, thus cre­at­ing in them an in­cen­tive to dis­guise them­selves as max­imis­ing your proxy.

You’ll of­ten in­cen­tivise the peo­ple around you to find ways to agree with you more than find­ing ways to suc­cess­fully com­mu­ni­cate. If you have a per­son who you’ve not suc­cess­fully com­mu­ni­cated with who says so, and an­other who is in the same state but pre­tends oth­er­wise, then you’ll pre­fer the liar.

Peo­ple who are very flex­ible with their be­liefs (i.e. don’t re­ally have mod­els) and good at sound­ing like they agree with you, will be re­warded the most. Th­ese are yes-men, they aren’t ac­tu­ally peo­ple who know how to up­date their be­liefs on a fun­da­men­tal level, and their qual­ities deeply do not cor­re­late with the goal of ‘good com­mu­ni­ca­tor’ at all.

Ad­ver­sar­ial good­hart­ing can hap­pen even if hu­mans aren’t ex­plic­itly try­ing to be hos­tile. Sure, the liars look­ing for power will try to agree with you more, but a perfectly well-in­ten­tioned man­ager good­hart­ing on agree­ment will try to get more power, en­tirely be­cause they ob­serve it leads to them agree­ing with peo­ple more, and that this is a sign of good rea­son­ing.

This is most clear if you are a man­ager. If you’re the boss, peo­ple have to agree with you more. If you’re us­ing agree­ment as your main proxy for good com­mu­ni­ca­tion and hon­estly not at­tempt­ing to grab power, you’ll nonethe­less learn the pat­tern that tak­ing power causes you to be a good com­mu­ni­ca­tor and rea­soner. And I don’t think this is at all un­likely. I can well imag­ine this hap­pen­ing by ac­ci­dent at the level of so­cial moves. “Huh, I no­tice when I stand up and speak force­fully, peo­ple agree with me more. This must be mak­ing me a bet­ter com­mu­ni­ca­tor—I’ll do this more!”


Fourthly and fi­nally, we come to most juicy type of good­hart­ing in this do­main, causal good­hart­ing. This is where you have the causal model the wrong way around—you no­tice that bas­ket­ball play­ers are taller than other peo­ple, so you start play­ing bas­ket­ball in an effort to get taller.

If you causal good­hart on agree­ment, you don’t be­lieve that good rea­son­ing and com­mu­ni­ca­tion cause agree­ment, but the op­po­site. You be­lieve that agree­ment causes good rea­son­ing. And so you try to di­rectly in­crease the amount of agree­mentin an at­tempt to rea­son bet­ter.

It seems to me that causal good­hart­ing is best un­der­stood by the be­liefs it leads to. Here are three, fol­lowed by some bul­leted ex­am­ples of what the be­liefs can lead to.

The first be­lief is if a rea­son­ing pro­cess doesn’t lead to agree­ment, that’s a bad pro­cess.

  • You’ll con­sider an ex­tended ses­sion of dis­cus­sion (e.g. two hours, two days) to be a failure if you don’t agree at the end, a suc­cess if you do, and not mea­sure things like “I learned a bunch more about how Tom thinks about man­age­ment” as pos­i­tive hits or “It turned out we’d been mak­ing a ba­sic mis­take for hours” as nega­tive hits.

The sec­ond be­lief is if I dis­agree with some­one, I’m bad at rea­son­ing.

  • If some­one has ex­pressed un­cer­tainty about a ques­tion, I’ll be hes­i­tant to ex­press a con­fi­dent opinion, be­cause then we’ll not be in agree­ment and that means I’m a bad com­mu­ni­ca­tor.

  • If it hap­pens the other way around, and you ex­press a con­fi­dent opinion af­ter I’ve ex­pressed un­cer­tainty, I’ll feel an im­pulse to say “Well that’s cer­tainly a rea­son­able opinion” (as op­posed to “That seems like the wrong prob­a­bil­ity to have”) be­cause then it sounds like we agree at least a bit. In gen­eral, when causal good­hart­ing, peo­ple will feel un­com­fortable hav­ing opinions—if you dis­agree with some­one it’s a sig­nal you are a bad com­mu­ni­ca­tor.

  • You’ll only have opinions ei­ther when you think the trade-off is worth it (“I see that I might look silly, but no, I ac­tu­ally care that we check the ex­haust is not about to catch fire”) or when you have a so­cial stand­ing such that peo­ple will defer to you (“Ac­tu­ally, if you are an ex­pert, then your opinion in this do­main gets to be right and we will agree with you”) - that way you can be free from the worry of sig­nal­ling you’re bad at com­mu­ni­ca­tion and co­or­di­na­tion.

In my own life I’ve found that treat­ing some­one as an ‘ex­pert’ - whether it’s some­one treat­ing me, or me treat­ing some­one else—lets that per­son ex­press their opinions more and with­out obfus­ca­tion or fear. It’s a so­cial move that helps peo­ple have opinions “Please meet my friend Jeff, who has a PhD in /​ has thought very care­fully about X.” If I can sig­nal that I defer to Jeff on this topic, then the so­cial at­mo­sphere can make similar moves, and stop Jeff be­ing afraid and ac­tu­ally think.

(My friend notes that some­times this goes the other way around, that some­times peo­ple are much more cau­tious when they feel they’re be­ing tested on knowl­edge they’re sup­posed to be an ex­pert on. This is true, I was talk­ing about the op­po­site, which hap­pens much more when the per­son is around a group of lay­men with­out their ex­per­tise.)

The third be­lief is if some­one isn’t try­ing to agree, they’re bad at rea­son­ing.

  • I’m of­ten in situ­a­tions where, if at the end of the con­ver­sa­tion peo­ple don’t say things like “Well you made good points, I’ll have to think about them” or “I sup­pose I’ll up­date to­ward your po­si­tion then” they’re called ‘bad at com­mu­ni­cat­ing’.

(An umeshism I have used many times in the past: If you never end­ing hang­outs with friends say­ing “Well you ex­plained your view many times and I asked clar­ify­ing ques­tions but I still don’t un­der­stand your per­spec­tive” then you’re not try­ing hard enough to un­der­stand the world—and in­stead are car­ing too much about sig­nal­ling agree­ment.)

  • If you dis­agree with some­one who sig­nals that they are con­fi­dent in their be­lief and are un­likely to change their mind, you’ll con­sider this a sign of be­ing bad at com­mu­ni­ca­tion, even if they send other sig­nals of hav­ing a good model of why you be­lieve what you be­lieve. Ba­si­cally, peo­ple who are right and con­fi­dent that they’re right, can end up look­ing like bad rea­son­ers. The nor­mal word I see used for these peo­ple is ‘over­con­fi­dent’.

(It’s re­ally weird to me how of­ten peo­ple judge oth­ers as over­con­fi­dent af­ter hav­ing one dis­agree­ment with them. Over­con­fi­dence is surely some­thing you can only tell about a per­son af­ter ob­serv­ing 10s of judge­ments.)

A gen­eral coun­ter­ar­gu­ment to many of these points, is that all of these are gen­uine signs of bad rea­son­ing or bad com­mu­ni­ca­tion. They are more likely to be seen in world where you or I are a bad rea­soner than if we’re not, so they are bayesian ev­i­dence. But the prob­lem I’m point­ing to is that, if your model only uses this heuris­tic, or if it takes it as the most im­por­tant heuris­tic that ac­counts for 99% of the var­i­ance, then it will fail hard on these edge cases.

To take the last ex­am­ple in the scat­ter­shot list of causal good­hart­ing, you might as­sign some­one who is rea­son­ing perfectly cor­rectly, as over­con­fi­dent. To jump back to re­gres­sional good­hart­ing, you’ll put some­one who just has the same back­ground mod­els as you at the top of your list of good rea­son­ers.

Over­all, I have many wor­ries about the prospect of us­ing agree­ment as a proxy for good rea­son­ing. I’m not sure of the ex­act rem­edy, though one rule I of­ten fol­low is: Re­spect peo­ple by giv­ing them your time, not your defer­rence. If I think some­one is a good rea­soner, I will spend hours or days try­ing to un­der­stand the gears of their model, and dis­agree­ing with them fiercely if we’re in per­son. Then at the end, af­ter learn­ing as much as I can, I’ll use what­ever mov­ing-parts model I even­tu­ally un­der­stand, us­ing all the ev­i­dence I’ve learned from them and from other sources. But I won’t just re­peat some­thing be­cause some­one said it.

My thanks to Mah­moud Ghanem for read­ing drafts.

No nominations.
No reviews.