nostalgebraist: Recursive Goodhart’s Law

Kaj_Sotala26 Aug 2020 11:07 UTC

53 points

Outer Alignment AI Goodhart's Law Planning & Decision-Making Rationality

Would kind of like to excerpt the whole post, but that feels impolite, so I’ll just quote the first four paragraphs and then suggest reading the whole thing:

There’s this funny thing about Goodhart’s Law, where it’s easy to say “being affected by Goodhart’s Law is bad” and “it’s better to behave in ways that aren’t as subject to Goodhart’s Law,” but it can be very hard to explain why these things are true to someone who doesn’t already agree.

Why? Because any such explanation is going to involve some step where you say, “see, if you do that, the results are worse.” But this requires some standard by which we can judge results … and any such standard, when examined closely enough, has Goodhart problems of its own.

There are times when you can’t convince someone without a formal example or something that amounts to one, something where you can say “see, Alice’s cautious heuristic strategy wins her $X while Bob’s strategy of computing the global optimum under his world model only wins him the smaller $Y, which is objectively worse!”

But if you’ve gotten to this point, you’ve conceded that there’s some function whose global optimum is the one true target. It’s hard to talk about Goodhart at all without something like this in the background – how can “the metric fails to capture the true target” be the problem unless there is some true target?

Kaj_Sotala26 Aug 2020 11:07 UTC

53 points

27 comments1 min readLW link

Outer Alignment AI Goodhart's Law Planning & Decision-Making Rationality

Dagon 26 Aug 2020 17:42 UTC
20 points
0
There are two related, but distinct problems under the heading of Goodhart, both caused by the hard-to-deny fact that any practical metric is only a proxy/estimate/correlate of the actual true goal.
1) anti-inductive behavior. When a metric gets well-known as a control input, mis-aligned agents can spoof their behavior to take advantage.
2) divergence of legible metric from true desire. Typically this increases over time, or over the experienced range of the metric.
I think you’re right that there is an underlying assumption that such a true goal exists. I don’t think there’s any reason to believe that it’s an understandable/legible function for human brains or technology. It could be polynomial or not, and it could have millions or billions of terms. In any actual human (and possibly in any embodied agent), it’s only partly specified, and even that partial specification isn’t fully accessible to introspection.
“it’s better to behave in ways that aren’t as subject to Goodhart’s Law,”
Specifics matter. It’s better to behave in ways that give better outcomes, but it’s not obvious at all what those ways are. Even ways that are known to be affected by goodhart’s law have SOME reason to believe they’re beneficial—goodhart isn’t (necessarily) a reversal of sign, only a loss of correlation with actual desires.
it can be very hard to explain why these things are true
Again, specifics matter. “Goodhart exists, and here’s how it might apply to your proposed metrics” has been an _easy_ discussion every time I’ve had it. I literally have never had a serious objection to the concept. What’s hard is figuring out checksum metrics or triggers to re-evaluate the control inputs. Defining the range and context inside of which a given measurement is “good enough” takes work, but is generally achievable (in the real world; perhaps not for theoretical AI work).
- Davidmanheim 27 Aug 2020 10:36 UTC
  6 points
  0
  Parent
  Strongly agree—and Goodhart’s law is at least 4 things. Though I’d note that anti-inductive behavior / metric gaming is hard to separate from goal mis-specification, for exactly the reasons outlined in the post.
  But saying there is a goal too complex to be understandable and legible implies that it’s really complex, but coherent. I don’t think that’s the case of individuals, and I’m certain it isn’t true of groups. (Arrow’s theorem, etc.)
  - Dagon 27 Aug 2020 14:15 UTC
    1 point
    0
    Parent
    But saying there is a goal too complex to be understandable and legible implies that it’s really complex, but coherent
    I’m not sure it’s possible to distinguish between chaotically-complex and incoherent. Once you add reference class problems in (you can’t step in the same river twice; no two decisions are exactly identical), there’s no difference between “inconsistent” and “unknown terms with large exponents on unmeasured variables”.
    But in any case, even without coherence/consistency across agents or over time, any given decision can be an optimization of something.
    [ I should probably add an epistemic status: not sure this is a useful model, but I do suspect there are areas it maps to the territory well. ]
    - Davidmanheim 28 Aug 2020 7:46 UTC
      2 points
      0
      Parent
      I’d agree with the epistemic warning ;)
      I don’t think the model is useful, since it’s non-predictive. And we have good reasons to think that human brains are actually incoherent. Which means I’m skeptical that there is something useful to find by fitting a complex model to find a coherent fit for an incoherent system.
      - abramdemski 28 Aug 2020 18:02 UTC
        6 points
        0
        Parent
        I think (1) Dagon is right that if we consider a purely behavioral perspective the distinction gets meaningless at the boundaries, trying to distinguish between highly complex values vs incoherence; any set of actions can be justified via some values; (2) humans are incoherent, in the sense that there are strong candidate partial specifications of our values (most of us like food and sex) and we’re not always the most sensible in how we go about achieving them; (3) also, to the extent that humans can be said to have values, they’re highly complex.
        The thing that makes these three statements consistent is that we use more than just a behavioral lense to judge “human values”.
abramdemski 26 Aug 2020 18:03 UTC
12 points
0
I really like this.
And I want to advise this person to sort of “chill out” in a way closely analogous to the way the hypothetical AI should “chill out” – a little more formally, I want to advise them to use a strategy that works better under model mis-specification, a strategy that doesn’t try to “jump to the end” and do the best thing without a continuous traversal of the intervening territory.
I like the analogy of “traversal of the intervening territory”, even though, like the author, I don’t know what it formally means.
Unlike the author, I do have some models of what it means to lack a utility function, but optimize anyway. Within such a model, I would say: it’s perfectly fine and possible to come up with a function which approximately represents your preferences, but holding on to such a function even after your preferences have updated away from it leads to a Goodhart-like risk.
More generally, it’s not the case that literally everything is well-understood as optimization of some function. There are lots of broadly intelligent processes that aren’t best-understood as optimization.
- Dagon 26 Aug 2020 21:50 UTC
  4 points
  0
  Parent
  There are lots of broadly intelligent processes that aren’t best-understood as optimization.
  I’d love to hear some examples, and start a catalog of “useful understandings of intelligent processes”. I believe that control/feedback mechanisms, evolution, the decision theories tossed about here (CDT, TDT, UDT, etc.), and VNM-compliant agents generally are all optimizers, though not all with the same complexity or capabilities.
  Humans aren’t VNM-rational agents over time, but I believe each instantaneous decision is an optimization calculation within the brain.
  - abramdemski 27 Aug 2020 16:08 UTC
    20 points
    0
    Parent
    Logic isn’t well-understood as optimization.
    Bayesian updates can sort of be thought of as optimization, but we can also notice the disanalogies. Bayesian updates aren’t really about maximizing something, but rather, about proportional response.
    Radical Probabilism is even less about optimizing something.
    So, taking those three together, it’s worth considering the claim that epistemics isn’t about optimization.
    This calls into question how much of machine learning, which is apparently about optimization, should really be taken as optimization or merely optimization-like.
    The goal is never really to maximize some function, but rather to “generalize well”, a task for which we can only ever give proxies.
    This fact influences the algorithms. Most overtly, regularization techniques are used to soften the optimization. Some of these are penalties added to the loss function, which keeps us squarely in the realm of optimizing some function. However, other techniques, such as dropout, do not take this form. Furthermore, we can think of the overall selection applied to ML techniques as selecting for generalization ability (not for ability to optimize some objective).
    Decision theories such as you name needn’t be formulated as optimizing some function in general, although that is the standard way to formulate them. Granted, they’re still going to argmax a quantity to decide on an action at the end of the day, but is that really enough to call the whole thing “optimization”? There’s a whole lot going on that’s not just optimization.
    There are a lot of ways in which evolution doesn’t fit the “optimization” frame, either. Many of those ways can be thought of merely as evolution being a poor optimizer. But I think a lot of it is also that evolution has the “epistemic” aspect—the remarkable thing about evolution isn’t how well it achieves some measurable objective, but rather its “generalization ability” (like Bayes/ML/etc). So maybe the optimization frame for evolution is somewhat good, but an epistemic/learning frame is also somewhat good.
    I think once one begins to enter this alternative frame where lots of things aren’t optimization, it starts to become apparent that “hardly anything is just optimization”—IE, understanding something as optimization often hardly explains anything about it, and there are often other frames which would explain much more.
    - Gordon Seidoh Worley 27 Aug 2020 21:49 UTC
      2 points
      0
      Parent
      I think once one begins to enter this alternative frame where lots of things aren’t optimization, it starts to become apparent that “hardly anything is just optimization”—IE, understanding something as optimization often hardly explains anything about it, and there are often other frames which would explain much more.
      I guess it depends on whether you want to keep “optimization” as a referent to the general motion that is making the world more likely to be one way than another or a specific type of making the world more likely to be one way rather than another. I think the former is more of a natural category for the types of things most people seem to mean by optimizing.
      None of this is to say, though, that there aren’t many processes where the optimization framing is not very useful. For example, you mention logic and Bayesian updating as examples, and that sounds right to me, because those are processes operating over the map rather than the territory (even if they are meant to be grounded in the territory), and when you only care about the map it doesn’t make much sense to talk about taking actions to make the world one way rather than another, because there is only one consistent way the world can be within the system of a particular map.
      - abramdemski 28 Aug 2020 15:36 UTC
        2 points
        0
        Parent
        I guess it depends on whether you want to keep “optimization” as a referent to the general motion that is making the world more likely to be one way than another or a specific type of making the world more likely to be one way rather than another.
        I suspect you’re trying to gesture at a slightly better definition here than the one you give, but since I’m currently in the business of arguing that we should be precise about what we mean by ‘optimization’… what do you mean here?
        Just about any element of the world will “make the world more likely to be one way rather than another”.
        Gordon Seidoh Worley 29 Aug 2020 1:22 UTC
        2 points
        0
        Parent
        Yeah, if I want to be precise, I mean anytime there is a feedback loop there is optimization.
        abramdemski 30 Aug 2020 8:29 UTC
        4 points
        0
        Parent
        That does seem better. But I don’t think it fills the shoes of the general notion of optimization people use.
        
        Either you mean negative feedback loops specifically, or you mean to include both negative and positive feedback loops. But both choices seem a little problematic to me.
        Negative: this seems to imply that all optimization is a kind of homeostasis. But it seems as if some optimization, at least, can be described as “more is better” (ie the more typical utility framing). It’s hard to see how to characterize that as a negative feedback loop.
        Both: this would include all positive feedback loops as well. But I think not all positive feedback loops have the optimization flavor. For example, there is a positive feedback loop between how reflective a water-rich planet’s surface is and how much snow forms. Snow makes the surface more reflective, which makes it absorb less heat, which makes it colder, which makes more snow form. Idk, maybe it’s fine for this to count as optimization? I’m not sure.
        Does guess-and-check search even count as a positive feedback loop? It’s definitely optimization in the broad sense, but finding one improvement doesn’t help you find another, because you’re just sampling each next point to check independently. So it doesn’t seem to fit that well into the feedback loop model.
        
        Feedback loops seem to require measurable (observable) targets. But broadly speaking, optimization can have non-measurable targets. An agent can steer the world based on a distant goal, such as aiming for a future with a flourishing galactic civilization.
        Now, it’s possible that the feedback loop model can recover from this one by pointing to a (positive) feedback loop within the agent’s head, where the agent is continually searching for better plans to implement. However, I believe this is not always going to be the case. The inside of an agent’s head can be a weird place. The “agent” could just be implementing a hardwired policy which indeed steers towards a distant goal, but which doesn’t do it through any simple feedback loop. (Of course there is still, broadly speaking, feedback between the agent and the environment. But the same could be said of a rock.) I think this counts as the broad notion of optimization, because it would count under Eliezer’s definition—the agent makes the world end up surprisingly high in the metric, even though there’s no feedback loop between it and the metric, nor an internal feedback loop involving a representation/expectation for the metric.
        
        Gordon Seidoh Worley 30 Aug 2020 21:33 UTC
        2 points
        0
        Parent
        I’m pretty happy to count all these things as optimization. Much of the issue I find with using the feedback loop definition is, as you point to, is the difficulty of figuring out things like “is there a lot here?”, suggesting there might be a better, more general model for what I’ve been pointing to work feedback loop because it’s simply the closest, most general model I know. Which actually points back to the way I phrased it before, which isn’t formalized but I think does come closer to expansively capturing all the things I think make sense to group together as “optimization”.
    - Dagon 27 Aug 2020 17:32 UTC
      2 points
      0
      Parent
      Awesome—I think I agree with most of this. Specifically, https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions is very compatible with the possibility that the function is too complex for any agent to actually compute. It’s quite likely that there are more potential worlds than an agent can rank, with more features than an agent can measure.
      Any feasible comparison of potential worlds is actually a comparison of predicted summaries of those worlds. Both the prediction and the summary are lossy, thus that asepect of Goodhart’s law.
      I did not mean to say that “everything is an optimization process”. I did mean to say that decisions are an optimization process, and I now realize even that’s too strong. I suspect all I can actually assert is that “intentionality is an optimization process”.
      - abramdemski 27 Aug 2020 18:20 UTC
        4 points
        0
        Parent
        I did not mean to say that “everything is an optimization process”. I did mean to say that decisions are an optimization process, and I now realize even that’s too strong. I suspect all I can actually assert is that “intentionality is an optimization process”.
        Oh, I didn’t mean to accuse you of that. It’s more that this is a common implicit frame of reference (including/especially on LW).
        I rather suspect the correct direction is to break down “optimization” into more careful concepts (starting, but not finishing, with something like selection vs control).
    - Dagon 27 Aug 2020 17:12 UTC
      2 points
      0
      Parent
      epistemics isn’t about optimization.
      On this, we’re fully agreed. Epistemics may be pre-optimization, or may not.
romeostevensit 27 Aug 2020 0:29 UTC
8 points
0
I am fond of describing Buddhism as un-goodharting yourself all the way down.
In brief, you are a Ship of Theseus at sea (or a Ship of Neurath as Quine coined after Otto von Neurath who first made the analogy), navigating by lighthouses that you should not steer directly towards (that’s not how lighthouses work!) and avoiding whirlpools (attractors in the space of goal architectures that lock you in to a specific destination).
Gordon Seidoh Worley 26 Aug 2020 19:00 UTC
7 points
0
I agree and glad this is getting upvotes, but for what it’s worth I made exactly the same point a year ago and several people were resistant to the core idea, so this is probably not an easily won insight.
- Kaj_Sotala 26 Aug 2020 20:34 UTC
  4 points
  0
  Parent
  Could you elaborate on that? The two posts seem to be talking about different things as far as I can tell: e.g. nostalgebraist doesn’t say anything about the Optimizer’s Curse, whereas your post relies on it.
  I do see that there are a few paragraphs that seem to reach similar conclusions (both say that overly aggressive optimization of any target is bad), but the reasoning used for reaching that conclusion seems different.
  (By the way, I don’t quite get your efficiency example? I interpret it as saying that you spent a lot of time and effort on optimizations that didn’t pay themselves back. I guess you might mean something like “I had a biased estimate of how much time my optimizations would save, so I chose expensive optimizations that turned out to be less effective than I thought.” But the example already suggests that you knew beforehand that the time saved would be on the order of a minute or so, so I’m not sure how the example is about Goodhart’s Curse.)
  - Gordon Seidoh Worley 26 Aug 2020 20:53 UTC
    5 points
    0
    Parent
    It’s mostly explicated down in the comments on the post where people started getting confused about just how integral the act of measuring is to doing anything. When I wrote the post I considered the point obvious enough to not need to be argued on its own, until I hit the comments.
    (On the example, I was a short sighted optimizer.)
Shmi 27 Aug 2020 6:28 UTC
4 points
0
In this situation Goodhart is basically open-loop optimization. An EE analogy would be a high gain op amp with no feedback circuit. The result is predictable: you end up optimized out of the linear mode and into saturation.
You can’t explicitly optimize for something you don’t know. And you don’t know what you really want. You might think you do, but, as usual, beware what you wish for. I don’t know if an AI can form a reasonable terminal goal to optimize, but humans surely cannot. Given that some 90% of our brain/mind is not available to introspection, all we have to go by is the vague feeling of “this feels right” or “this is fishy but I cannot put my finger on why”. That’s why cautiously iterating with periodic feedback is so essential, and open-loop optimization is bound to get you to all the wrong places.
betulaster 29 Aug 2020 4:28 UTC
3 points
0
This post reminds me of an insight from one of my uni professors.
Early on at university, I was very frustrated with that the skills that were taught to us did not seem to be immediately applicable to the real world. That frustration was strong enough to snuff out most of the interest I had for studying genuinely (that is, to truly understand and internalize the concepts taught to us). Still, studying was expensive, dropping out was not an option, and I had to pass exams, which is why very early on I started, in what seemed to me to be a classic instance of Goodhart, to game the system—test banks were videly circulated among students, and for the classes with no test banks, there were past exams, which you could go through, trace out some kind of pattern for which topics and kinds of problems the prof puts on exams, and focus only on stuying those. I didn’t know it was called “Goodhart” back then, but the significance of this was not lost on me—I felt that by pivoting away from learning subjects and towards learning to pass exams in subjects, I was intellectually cheating. Sure, I was not hiding crib sheets in my sleeves or going to the restroom to look something up on my phone, but it was still gaming the system.
Later on, when I got rather friendly with one of my profs, and extremely worn down by pressures from my probability calculus course, I admitted to him that this was what I was doing, that I felt guilty, and didn’t feel able to pass any other way and felt like a fake. He said something to the effect of “Do you think we don’t know this? Most students study this way, and that’s fine. The characteristic of a well-structured exam isn’t that it does not allow cheating, it’s that it only allows cheating that is intelligent enough that a successful cheater would have been able to pass fairly.”
What he said was essentially a refutation of Goodhart’s Law by a sufficiently high-quality proxy. I think this might be relevant to the case you’re dealing with here as well. Your “true” global optimum probably is a proxy, but if it’s a well-chosen one, it need not be vulnerable to Goodhart.
Brendan Long 26 Aug 2020 15:08 UTC
3 points
0
The author seems to be skipping a step in their argument. I thought Goodhart’s Law was about how it’s hard to specify a measurable target which exactly matches your true goal, not that true goals don’t exist.

For example, if I wanted to donate to a COVID-19 charity, I might pick one with the measurable goal of reducing the official case numbers.. and they could spend all of their money bribing people to not report cases or make testing harder. Or if they’re an AI, they could hit this goal perfectly by killing all humans. But just because this goal (and probably all easily measurable goals) are Goodhartable doesn’t mean all possible goals are. The thing I actually want is still well defined (I want actual COVID-19 cases to decrease and I want the method to pass a filter defined by my brain), it’s just that the real fundamental thing I want is impossible to measure.
- abramdemski 26 Aug 2020 18:12 UTC
  10 points
  0
  Parent
  The author seems to be skipping a step in their argument. I thought Goodhart’s Law was about how it’s hard to specify a measurable target which exactly matches your true goal, not that true goals don’t exist.
  But this is the point. That’s why it was titled “recursive goodhart’s law”—the idea being that at any point where you explicitly point to a “true goal” and a “proxy”, you’ve probably actually written down two different proxies of differing quality. So you can keep trying to write down ever-more-faithful proxies, or you can “admit defeat” and attempt to make due without an explicitly written down function.
  And the author explicitly admits that they don’t have a good way to convince people of this, so, yeah, they’re missing a step in their argument. They’re more saying some things that are true and less trying to convince.
  As for whether it’s true—yeah, this is basically the whole value specification problem in AI alignment.
  I agree that Goodhard isn’t just about “proxies”, it’s more specifically about “measurable proxies”, and the post isn’t really engaging with that aspect. But I think that’s fine. There’s also a Goodhart problem wrt proxies more generally.
  - Davidmanheim 27 Aug 2020 10:44 UTC
    2 points
    0
    Parent
    I talked about this in terms of “underspecified goals”—often, the true goal doesn’t usually exist clearly, and may not be coherent. Until that’s fixed, the problem isn’t really Goodhart, it’s just sucking at deciding what you want.
    I’m thinking of a young kid in a candy store who has $1, and wants everything, and can’t get it. What metric for choosing what to purchase will make them happy? Answer: There isn’t one. What they want is too unclear for them to be happy. So I can tell you in advance that they’re going to have a tantrum later about wanting to have done something else no matter what happens now. That’s not because they picked the wrong goal, it’s because their desires aren’t coherent.
- Kaj_Sotala 26 Aug 2020 17:20 UTC
  6 points
  0
  Parent
  But “COVID-19 cases decreasing” is probably not your ultimate goal: more likely, it’s an instrumental goal for something like “prevent humans from dying” or “help society” or whatever… in other words, it’s a proxy for some other value. And if you walk back the chain of goals enough, you are likely to arrive at something that isn’t well defined anymore.
Charlie Steiner 26 Aug 2020 16:17 UTC
1 point
0
Yup. Humans have a sort of useful insanity, where they can expect things to be bad not based on explicitly evaluating the consequences, but off of a model or heurstic about what to expect from different strategies. And then we somehow only apply this reasoning selectively, where it seems appropriate according to even more heuristics.