Goodhart’s Law

TagLast edit: 19 Mar 2023 21:29 UTC by Diabloto96

Goodhart’s Law states that when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy. One form of Goodhart is demonstrated by the Soviet story of a factory graded on how many shoes they produced (a good proxy for productivity) – they soon began producing a higher number of tiny shoes. Useless, but the numbers look good.

Goodhart’s Law is of particular relevance to AI Alignment. Suppose you have something which is generally a good proxy for “the stuff that humans care about”, it would be dangerous to have a powerful AI optimize for the proxy, in accordance with Goodhart’s law, the proxy will breakdown.

Goodhart Taxonomy

In Goodhart Taxonomy, Scott Garrabrant identifies four kinds of Goodharting:

Regressional Goodhart—When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal.
Causal Goodhart—When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal.
Extremal Goodhart—Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed.
Adversarial Goodhart—When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal.

See Also

Goodhart Taxonomy

Scott Garrabrant30 Dec 2017 16:38 UTC

208 points

33 comments10 min readLW link

Classifying specification problems as variants of Goodhart’s Law

Vika19 Aug 2019 20:40 UTC

72 points

5 comments5 min readLW link 1 review

Specification gaming examples in AI

Vika3 Apr 2018 12:30 UTC

45 points

9 comments1 min readLW link 2 reviews

Replacing Karma with Good Heart Tokens (Worth $1!)

Ben Pace and habryka

1 Apr 2022 9:31 UTC

224 points

173 comments4 min readLW link

Everything I ever needed to know, I learned from World of Warcraft: Goodhart’s law

Said Achmiz3 May 2018 16:33 UTC

37 points

21 comments6 min readLW link 1 review

(blog.obormot.net)

Goodhart’s Law Causal Diagrams

JustinShovelain and Jeremy Gillen

11 Apr 2022 13:52 UTC

33 points

5 comments6 min readLW link

Signaling isn’t about signaling, it’s about Goodhart

Valentine6 Jan 2022 18:49 UTC

58 points

31 comments9 min readLW link

When is Goodhart catastrophic?

Drake Thomas and Thomas Kwa

9 May 2023 3:59 UTC

169 points

26 comments8 min readLW link

Goodhart’s Curse and Limitations on AI Alignment

Gordon Seidoh Worley19 Aug 2019 7:57 UTC

25 points

18 comments9 min readLW link

How much do you believe your results?

Eric Neyman6 May 2023 20:31 UTC

463 points

14 comments15 min readLW link

(ericneyman.wordpress.com)

The Natural State is Goodhart

devansh20 Mar 2023 0:00 UTC

59 points

4 comments2 min readLW link

The Importance of Goodhart’s Law

blogospheroid13 Mar 2010 8:19 UTC

116 points

123 comments3 min readLW link

Introduction to Reducing Goodhart

Charlie Steiner26 Aug 2021 18:38 UTC

48 points

10 comments4 min readLW link

[Question] How does Gradient Descent Interact with Goodhart?

Scott Garrabrant2 Feb 2019 0:14 UTC

68 points

19 comments4 min readLW link

Approximately Bayesian Reasoning: Knightian Uncertainty, Goodhart, and the Look-Elsewhere Effect

RogerDearnaley26 Jan 2024 3:58 UTC

14 points

0 comments11 min readLW link

Goodhart Taxonomy: Agreement

Ben Pace1 Jul 2018 3:50 UTC

44 points

4 comments7 min readLW link

Catastrophic Regressional Goodhart: Appendix

Thomas Kwa and Drake Thomas

15 May 2023 0:10 UTC

25 points

1 comment9 min readLW link

Is Google Paperclipping the Web? The Perils of Optimization by Proxy in Social Systems

Alexandros10 May 2010 13:25 UTC

56 points

105 comments10 min readLW link

Defeating Goodhart and the “closest unblocked strategy” problem

Stuart_Armstrong3 Apr 2019 14:46 UTC

45 points

15 comments6 min readLW link

Using expected utility for Good(hart)

Stuart_Armstrong27 Aug 2018 3:32 UTC

42 points

5 comments4 min readLW link

New Paper Expanding on the Goodhart Taxonomy

Scott Garrabrant14 Mar 2018 9:01 UTC

17 points

4 comments1 min readLW link

(arxiv.org)

Does Bayes Beat Goodhart?

abramdemski3 Jun 2019 2:31 UTC

48 points

26 comments7 min readLW link

Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom)

RogerDearnaley25 May 2023 9:26 UTC

32 points

3 comments15 min readLW link

Is Clickbait Destroying Our General Intelligence?

Eliezer Yudkowsky16 Nov 2018 23:06 UTC

190 points

65 comments5 min readLW link 2 reviews

Don’t design agents which exploit adversarial inputs

TurnTrout and Garrett Baker

18 Nov 2022 1:48 UTC

70 points

64 comments12 min readLW link

Markets are Anti-Inductive

Eliezer Yudkowsky26 Feb 2009 0:55 UTC

87 points

61 comments4 min readLW link

Noticing the Taste of Lotus

Valentine27 Apr 2018 20:05 UTC

204 points

81 comments3 min readLW link 3 reviews

Guarding Slack vs Substance

Raemon13 Dec 2017 20:58 UTC

39 points

6 comments6 min readLW link

Humans are not automatically strategic

AnnaSalamon8 Sep 2010 7:02 UTC

534 points

277 comments4 min readLW link

What does Optimization Mean, Again? (Optimizing and Goodhart Effects—Clarifying Thoughts, Part 2)

Davidmanheim28 Jul 2019 9:30 UTC

26 points

7 comments4 min readLW link

Constructing Goodhart

johnswentworth3 Feb 2019 21:59 UTC

29 points

10 comments3 min readLW link

Bounding Goodhart’s Law

eric_langlois11 Jul 2018 0:46 UTC

43 points

2 comments5 min readLW link

The Goodhart Game

John_Maxwell18 Nov 2019 23:22 UTC

13 points

5 comments5 min readLW link

Reward hacking and Goodhart’s law by evolutionary algorithms

Jan_Kulveit30 Mar 2018 7:57 UTC

18 points

5 comments1 min readLW link

(arxiv.org)

If I were a well-intentioned AI… III: Extremal Goodhart

Stuart_Armstrong28 Feb 2020 11:24 UTC

22 points

0 comments5 min readLW link

Non-Adversarial Goodhart and AI Risks

Davidmanheim27 Mar 2018 1:39 UTC

22 points

11 comments6 min readLW link

(Some?) Possible Multi-Agent Goodhart Interactions

Davidmanheim22 Sep 2018 17:48 UTC

20 points

2 comments5 min readLW link

Re-introducing Selection vs Control for Optimization (Optimizing and Goodhart Effects—Clarifying Thoughts, Part 1)

Davidmanheim2 Jul 2019 15:36 UTC

31 points

5 comments4 min readLW link

Fundamental Uncertainty: Chapter 8 - When does fundamental uncertainty matter?

Gordon Seidoh Worley26 Apr 2024 18:10 UTC

11 points

2 comments32 min readLW link

Optimized for Something other than Winning or: How Cricket Resists Moloch and Goodhart’s Law

A.H.5 Jul 2023 12:33 UTC

53 points

25 comments4 min readLW link

nostalgebraist: Recursive Goodhart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC

53 points

27 comments1 min readLW link

(nostalgebraist.tumblr.com)

Goodhart’s Law Example: Training Verifiers to Solve Math Word Problems

Chris_Leong25 Nov 2023 0:53 UTC

27 points

2 comments1 min readLW link

(arxiv.org)

Catastrophic Goodhart in RL with KL penalty

Thomas Kwa and Adrià Garriga-alonso

15 May 2024 0:58 UTC

56 points

9 comments7 min readLW link

Honest science is spirituality

pchvykov1 Jul 2024 20:33 UTC

−1 points

8 comments4 min readLW link

The Dumbification of our smart screens

Itay Dreyfus4 Jul 2024 6:32 UTC

17 points

0 comments5 min readLW link

(productidentity.co)

Satisficers want to become maximisers

Stuart_Armstrong21 Oct 2011 16:27 UTC

38 points

70 comments1 min readLW link

The Three Levels of Goodhart’s Curse

Scott Garrabrant30 Dec 2017 16:41 UTC

7 points

2 comments3 min readLW link

How my school gamed the stats

Srdjan Miletic20 Feb 2021 19:23 UTC

82 points

26 comments4 min readLW link

Bootstrapped Alignment

Gordon Seidoh Worley27 Feb 2021 15:46 UTC

20 points

12 comments2 min readLW link

Competent Preferences

Charlie Steiner2 Sep 2021 14:26 UTC

30 points

2 comments6 min readLW link

Goodhart Ethology

Charlie Steiner17 Sep 2021 17:31 UTC

20 points

4 comments14 min readLW link

Models Modeling Models

Charlie Steiner2 Nov 2021 7:08 UTC

23 points

5 comments10 min readLW link

My Overview of the AI Alignment Landscape: Threat Models

Neel Nanda25 Dec 2021 23:07 UTC

52 points

3 comments28 min readLW link

All I know is Goodhart

Stuart_Armstrong21 Oct 2019 12:12 UTC

28 points

23 comments3 min readLW link

[Intro to brain-like-AGI safety] 10. The alignment problem

Steven Byrnes30 Mar 2022 13:24 UTC

48 points

6 comments19 min readLW link

Proxy misspecification and the capabilities vs. value learning race

Sam Marks16 May 2022 18:58 UTC

23 points

3 comments4 min readLW link

Embedded Agency (full-text version)

Scott Garrabrant and abramdemski

15 Nov 2018 19:49 UTC

184 points

17 comments54 min readLW link

Robust Delegation

abramdemski and Scott Garrabrant

4 Nov 2018 16:38 UTC

116 points

10 comments1 min readLW link

Reducing Goodhart: Announcement, Executive Summary

Charlie Steiner20 Aug 2022 9:49 UTC

16 points

0 comments1 min readLW link

Optimization Amplifies

Scott Garrabrant27 Jun 2018 1:51 UTC

114 points

12 comments4 min readLW link

Specification gaming: the flip side of AI ingenuity

Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton and janleike

6 May 2020 23:51 UTC

66 points

9 comments6 min readLW link

Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

TurnTrout29 Nov 2022 6:23 UTC

60 points

42 comments15 min readLW link

Don’t align agents to evaluations of plans

TurnTrout26 Nov 2022 21:16 UTC

45 points

49 comments18 min readLW link

Soft optimization makes the value target bigger

Jeremy Gillen2 Jan 2023 16:06 UTC

117 points

20 comments12 min readLW link

[Question] Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why?

DragonGod9 Feb 2023 13:36 UTC

22 points

42 comments2 min readLW link

Validator models: A simple approach to detecting goodharting

beren20 Feb 2023 21:32 UTC

14 points

1 comment4 min readLW link

When Can Optimization Be Done Safely?

StrivingForLegibility30 Dec 2023 1:24 UTC

12 points

0 comments3 min readLW link

Aldix and the Book of Life

ville1 Jan 2024 17:23 UTC

1 point

0 comments4 min readLW link

(medium.com)

Extinction Risks from AI: Invisible to Science?

VojtaKovarik, Chris van Merwijk and Ida Mattsson

21 Feb 2024 18:07 UTC

24 points

7 comments1 min readLW link

(arxiv.org)

Dynamics Crucial to AI Risk Seem to Make for Complicated Models

VojtaKovarik and Ida Mattsson

21 Feb 2024 17:54 UTC

19 points

0 comments9 min readLW link

Extinction-level Goodhart’s Law as a Property of the Environment

VojtaKovarik and Ida Mattsson

21 Feb 2024 17:56 UTC

23 points

0 comments10 min readLW link

Goodhart’s Law and Emotions

Zero Contradictions7 Jul 2024 8:32 UTC

0 points

5 comments1 min readLW link

(expandingrationality.substack.com)

[Aspiration-based designs] A. Damages from misaligned optimization – two more models

Jobst Heitzig and Simon Dima

15 Jul 2024 14:08 UTC

6 points

0 comments9 min readLW link

Thinking about maximization and corrigibility

James Payor21 Apr 2023 21:22 UTC

63 points

4 comments5 min readLW link

Moral Mazes and Short Termism

Zvi2 Jun 2019 11:30 UTC

74 points

21 comments4 min readLW link

(thezvi.wordpress.com)

The new dot com bubble is here: it’s called online advertising

Gordon Seidoh Worley18 Nov 2019 22:05 UTC

50 points

17 comments2 min readLW link

(thecorrespondent.com)

How Doomed are Large Organizations?

Zvi21 Jan 2020 12:20 UTC

79 points

42 comments9 min readLW link

(thezvi.wordpress.com)

When to use quantilization

RyanCarey5 Feb 2019 17:17 UTC

65 points

5 comments4 min readLW link

Leto among the Machines

Virgil Kurkjian30 Sep 2018 21:17 UTC

57 points

20 comments13 min readLW link

The Lesson To Unlearn

Ben Pace8 Dec 2019 0:50 UTC

37 points

11 comments1 min readLW link

(paulgraham.com)

“Designing agent incentives to avoid reward tampering”, DeepMind

gwern14 Aug 2019 16:57 UTC

28 points

15 comments1 min readLW link

(medium.com)

Lotuses and Loot Boxes

Davidmanheim17 May 2018 0:21 UTC

14 points

2 comments4 min readLW link

AISC team report: Soft-optimization, Bayes and Goodhart

Simon Fischer, benjaminko, jazcarretao, DFNaiff and Jeremy Gillen

27 Jun 2023 6:05 UTC

37 points

2 comments15 min readLW link

Specification gaming examples in AI

Samuel Rødal10 Nov 2018 12:00 UTC

24 points

6 comments1 min readLW link

(docs.google.com)

Superintelligence 12: Malignant failure modes

KatjaGrace2 Dec 2014 2:02 UTC

15 points

51 comments5 min readLW link

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

Stuart_Armstrong19 Dec 2019 13:55 UTC

24 points

18 comments7 min readLW link

The Ancient God Who Rules High School

lifelonglearner5 Apr 2017 18:55 UTC

13 points

113 comments1 min readLW link

(medium.com)

Religion as Goodhart

shminux8 Jul 2019 0:38 UTC

21 points

6 comments2 min readLW link

Goodhart’s Law in Reinforcement Learning

jacek, Joar Skalse, OliverHayman, charlie_griffin and Xingjian Bai

16 Oct 2023 0:54 UTC

125 points

22 comments7 min readLW link

The reverse Goodhart problem

Stuart_Armstrong8 Jun 2021 15:48 UTC

20 points

22 comments1 min readLW link

The Paradox of Expert Opinion

Emrik26 Sep 2021 21:39 UTC

12 points

9 comments2 min readLW link

Why Agent Foundations? An Overly Abstract Explanation

johnswentworth25 Mar 2022 23:17 UTC

296 points

56 comments8 min readLW link 1 review

Practical everyday human strategizing

akaTrickster27 Mar 2022 14:20 UTC

6 points

0 comments3 min readLW link

Bayesianism versus conservatism versus Goodhart

Stuart_Armstrong16 Jul 2021 23:39 UTC

15 points

2 comments6 min readLW link

The Dark Miracle of Optics

Suspended Reason24 Jun 2020 3:09 UTC

27 points

5 comments8 min readLW link

Can “Reward Economics” solve AI Alignment?

Q Home7 Sep 2022 7:58 UTC

3 points

15 comments18 min readLW link

Oversight Leagues: The Training Game as a Feature

Paul Bricman9 Sep 2022 10:08 UTC

20 points

6 comments10 min readLW link

Outer alignment and imitative amplification

evhub10 Jan 2020 0:26 UTC

24 points

11 comments9 min readLW link

Scaling Laws for Reward Model Overoptimization

leogao, John Schulman and Jacob_Hilton

20 Oct 2022 0:20 UTC

102 points

13 comments1 min readLW link

(arxiv.org)

Resolutions to the Challenge of Resolving Forecasts

Davidmanheim11 Mar 2021 19:08 UTC

58 points

13 comments5 min readLW link

Degamification

Nate Showell19 Feb 2023 5:35 UTC

23 points

2 comments2 min readLW link

Weak vs Quantitative Extinction-level Goodhart’s Law

VojtaKovarik and Ida Mattsson

21 Feb 2024 17:38 UTC

17 points

1 comment2 min readLW link

No comments.