Lauro Langosco comments on Evaluating the historical value misspecification argument

Lauro Langosco 6 Oct 2023 14:07 UTC
LW: 17 AF: 9
17
AF
You make a claim that’s very close to that—your claim, if I understand correctly, is that MIRI thought AI wouldn’t understand human values and also not lie to us about it (or otherwise decide to give misleading or unhelpful outputs):

The key difference between the value identification/specification problem and the problem of getting an AI to understand human values is the transparency and legibility of how the values are represented: if you solve the problem of value identification, that means you have an actual function that can tell you the value of any outcome (which you could then, hypothetically, hook up to a generic function maximizer to get a benevolent AI). If you get an AI that merely understands human values, you can’t necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

I think this is similar enough (and false for the same reasons) that I don’t think the responses are misrepresenting you that badly. Of course I might also be misunderstanding you, but I did read the relevant parts multiple times to make sure, so I don’t think it makes sense to blame your readers for the misunderstanding.
- dsj 6 Oct 2023 15:11 UTC
  5 points
  1
  Parent
  I think you’re misunderstanding the paragraph you’re quoting. I read Matthew, in that paragraph as acknowledging the difference between the two problems, and saying that MIRI thought value specification (not value understanding) was much harder than it’s looking to actually be.
  - Lauro Langosco 6 Oct 2023 15:22 UTC
    1 point
    0
    Parent
    I think we agree—that sounds like it matches what I think Matthew is saying.
    - dsj 6 Oct 2023 15:30 UTC
      2 points
      −1
      Parent
      Hmm, you say “your claim, if I understand correctly, is that MIRI thought AI wouldn’t understand human values”. I’m disagreeing with this. I think Matthew isn’t claiming that MIRI thought AI wouldn’t understand human values.
      - Lauro Langosco 6 Oct 2023 16:00 UTC
        5 points
        10
        Parent
        I think maybe there’s a parenthesis issue here :)
        
        I’m saying “your claim, if I understand correctly, is that MIRI thought AI wouldn’t (understand human values and also not lie to us)”.
        dsj 8 Oct 2023 2:21 UTC
        4 points
        −5
        Parent
        Okay, that clears things up a bit, thanks. :) (And sorry for delayed reply. Was stuck in family functions for a couple days.)
        
        This framing feels a bit wrong/confusing for several reasons.
        
        I guess by “lie to us” you mean act nice on the training distribution, waiting for a chance to take over the world while off distribution. I just … don’t believe GPT-4 is doing this; it seems highly implausible to me, in large part because I don’t think GPT-4 is clever enough that it could keep up the veneer until it’s ready to strike if that were the case.
        
        The term “lie to us” suggests all GPT-4 does is say things, and we don’t know how it’ll “behave” when we finally trust it and give it some ability to act. But it only “says things” in the same sense that our brain only “emits information”. GPT-4 is now hooked up to web searches, code writing, etc. But maybe I misunderstand the sense in which you think GPT-4 is lying to us?
        
        I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.
        
        (If I’ve misunderstood your point, sorry! Please feel free to clarify and I’ll try to engage with what you actually meant.)
        Rob Bensinger 8 Oct 2023 6:15 UTC
        12 points
        3
        Parent
        I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.
        As someone who worked closely with Eliezer and Nate at the time, including working with Eliezer and Nate on our main write-ups that used the cauldron example, I can say that this is definitely not what we were thinking at the time. Rather:
        The point was to illustrate a weird gap in the expressiveness and coherence of our theories of rational agency: “fill a bucket of water” seems like a simple enough task, but it’s bizarrely difficult to just write down a simple formal description of an optimization process that predictably does this (without any major side-effects, etc.).
        (We can obviously stipulate “this thing is smart enough to do the thing we want, but too dumb to do anything dangerous”, but the relevant notion of “smart enough” is not itself formal; we don’t understand optimization well enough to formally define agents that have all the cognitive abilities we want and none of the abilities we don’t want.)
        The point of emphasizing “holy shit, this seems so easy and simple and yet we don’t see a way to do it!” wasn’t to issue a challenge to capabilities researches to go cobble together a real-world AI that can fill a bucket of water without destroying the world. The point was to emphasize that corrigibility, low-impact problem-solving, ‘real’ satisficing behavior, etc. seem conceptually simple, and yet the concepts have no known formalism.
        The hope was that someone would see the simple toy problems and go ‘what, no way, this sounds easy’, get annoyed/nerdsniped, run off to write some equations on a whiteboard, and come back a week or a year later with a formalism (maybe from some niche mathematical field) that works totally fine for this, and makes it easier to formalize lots of other alignment problems in simplified settings (e.g., with unbounded computation).
        Or failing that, the hope was that someone might at least come up with a clever math hack that solves the immediate ‘get the AI to fill the bucket and halt’ problem and replaces this dumb-sounding theory question with a slightly deeper theory question.
        By using a children’s cartoon to illustrate the toy problem, we hoped to make it clearer that the genre here is “toy problem to illustrate a weird conceptual issue in trying to define certain alignment properties”, not “robotics problem where we show a bunch of photos of factory robots and ask how we can build a good factory robot to refill water receptacles used in industrial applications”.
        Nate’s version of the talk, which is mostly a more polished version of Eliezer’s talk, is careful to liberally sprinkle in tons of qualifications like (emphasis added)
        ″… for systems that are sufficiently good at modeling their environment”,
        ‘if the system is smart enough to recognize that shutdown will lower its score’,
        “Relevant safety measures that don’t assume we can always outthink and outmaneuver the system...”,
        … to make it clearer that the general issue is powerful, strategic optimizers that have high levels of situational awareness, etc., not necessarily ‘every system capable enough to fill a bucket of water’ (or ‘every DL system...’).
        Rob Bensinger 8 Oct 2023 6:19 UTC
        5 points
        2
        Parent
        Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn’t (and isn’t) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn’t otherwise in the business of challenging people to race to AGI faster.
        It wouldn’t have occurred to me that someone might think ‘can a deep net fill a bucket of water, in real life, without being dangerously capable’ is a crucial question in this context; I’m not sure we ever even had the thought occur in our heads ‘when might such-and-such DL technique successfully fill a bucket?’. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.
        (And while I think I understand why others see ChatGPT as a large positive update about alignment’s difficulty, I hope it’s also obvious why others, MIRI included, would not see it that way.)
        Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches—the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn’t look to me like it actually lets you build a corrigible AI that can help with a pivotal act.
        If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it’s a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.
        (Though now that I’ve seen the confusion the example causes, I’m more inclined to think that the strawberry problem is a better frame than the Fantasia example.)
        dsj 8 Oct 2023 7:24 UTC
        3 points
        2
        Parent
        I think this reply is mostly talking past my comment.
        I know that MIRI wasn’t claiming we didn’t know how to safely make deep learning systems, GOFAI systems, or what-have-you fill buckets of water, but my comment wasn’t about those systems. I also know that MIRI wasn’t issuing a water-bucket-filling challenge to capabilities researchers.
        My comment was specifically about directing an AGI (which I think GPT-4 roughly is), not deep learning systems or other software generally. I *do* think MIRI was claiming we didn’t know how to make AGI systems safely do mundane tasks.
        I think some of Nate’s qualifications are mainly about the distinction between AGI and other software, and others (such as “[i]f the system is trying to drive up the expectation of its scoring function and is smart enough to recognize that its being shut down will result in lower-scoring outcomes”) mostly serve to illustrate the conceptual frame MIRI was (and largely still is) stuck in about how an AGI would work: an argmaxer over expected utility.
        [Edited to add: I’m pretty sure GPT-4 is smart enough to know the consequences of its being shut down, and yet dumb enough that, if it really wanted to prevent that from one day happening, we’d know by now from various incompetent takeover attempts.]
        Lauro Langosco 8 Oct 2023 15:00 UTC
        2 points
        0
        Parent
        I’m not saying that GPT-4 is lying to us—that part is just clarifying what I think Matthew’s claim is.
        
        Re cauldron: I’m pretty sure MIRI didn’t think that. Why would they?
        dsj 8 Oct 2023 16:45 UTC
        3 points
        2
        Parent
        Okay. I do agree that one way to frame Matthew’s main point is that MIRI thought it would be hard to specify the human value function, and an LM that understands human values and reliably tells us the truth about that understanding is such a specification, and hence falsifies that belief.
        
        To your second question: MIRI thought we couldn’t specify the value function to do the bounded task of filling the cauldron, because any value function we could naively think of writing, when given to an AGI (which was assumed to be a utility argmaxer), leads to all sorts of instrumentally convergent behavior such as taking over the world to make damn sure the cauldron is really filled, since we forgot all the hidden complexity of our wish.
- Matthew Barnett 6 Oct 2023 15:33 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  I think this is similar enough (and false for the same reasons)
  
  I agree the claim is “similar”. It’s actually a distinct claim, though. What are the reasons why it’s false? (And what do you mean by saying that what I wrote is “false”? I think the historical question is what’s important in this case. I’m not saying that solving the value specification problem means that we have a full solution to the alignment problem, or that inner alignment is easy now.)
  - Lauro Langosco 6 Oct 2023 16:16 UTC
    LW: 14 AF: 9
    6
    AF Parent
    I think it’s false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it’s false is mostly that I haven’t seen a claim like that made anywhere, including in the posts you cite.
    
    I agree lots of the responses elide the part where you emphasize that it’s important how GPT-4 doesn’t just understand human values, but is also “willing” to answer questions somewhat honestly. TBH I don’t understand why that’s an important part of the picture for you, and I can see why some responses would just see the “GPT-4 understands human values” part as the important bit (I made that mistake too on my first reading, before I went back and re-read).
    
    It seems to me that trying to explain the original motivations for posts like Hidden Complexity of Wishes is a good attempt at resolving this discussion, and it looks to me as if the responses from MIRI are trying to do that, which is part of why I wanted to disagree with the claim that the responses are missing the point / not engaging productively.
    - Matthew Barnett 8 Oct 2023 3:59 UTC
      LW: 3 AF: 2
      −6
      AF Parent
      
      I think it’s false in the sense that MIRI never claimed that it would be hard to build an AI with GPT-4 level understanding of human values + GPT-4 level of willingness to answer honestly (as far as I can tell). The reason I think it’s false is mostly that I haven’t seen a claim like that made anywhere, including in the posts you cite.
      
      I don’t think it’s necessary for them to have made that exact claim. The point is that they said value specification would be hard.
      
      If you solve value specification, then you’ve arguably solved ~~the outer alignment problem~~ a large part of the outer alignment problem. Then, you just need to build a function maximizer that allows you to robustly maximize the utility function that you’ve specified. [ETA: btw, I’m not saying the outer alignment problem has been fully solved already. I’m making a claim about progress, not about whether we’re completely finished.]
      
      I interpret MIRI as saying “but the hard part is building a function maximizer that robustly maximizes any utility function you specify”. And while I agree that this represents their current view, I don’t think this was always their view. You can read the citations in the post carefully, and I don’t think they support the idea that they’ve consistently always considered inner alignment to be the only hard part of the problem. I’m not claiming they never thought inner alignment was hard. But I am saying they thought value specification would be hard and an important part of the alignment problem.
      What links here?
      Steven Byrnes's comment on Evaluating the historical value misspecification argument by Matthew Barnett (15 Apr 2025 17:34 UTC; 2 points)
      - Lauro Langosco 8 Oct 2023 14:53 UTC
        LW: 4 AF: 2
        2
        AF Parent
        I think the specification problem is still hard and unsolved. It looks like you’re using a different definition of ‘specification problem’ / ‘outer alignment’ than others, and this is causing confusion.
        
        IMO all these terms are a bit fuzzy / hard to pin down, and so it makes sense that they’d lead to disagreement sometimes. The best way (afaict) to avoid this is to keep the terms grounded in ‘what would be useful for avoiding AGI doom’? To me it looks like on your definition, outer alignment is basically a trivial problem that doesn’t help alignment much.
        
        More generally, I think this discussion would be more grounded / useful if you made more object-level claims about how value specification being solved (on your view) might be useful, rather than meta claims about what others were wrong about.
        Matthew Barnett 8 Oct 2023 19:17 UTC
        LW: 2 AF: 1
        0
        AF Parent
        Can you explain how you’re defining outer alignment and value specification?
        I’m using this definition, provided by Hubinger et al.
        the outer alignment problem is an alignment problem between the system and the humans outside of it (specifically between the base objective and the programmer’s intentions). In the context of machine learning, outer alignment refers to aligning the specified loss function with the intended goal, whereas inner alignment refers to aligning the mesa-objective of a mesa-optimizer with the specified loss function.
        Evan Hubinger provided clarification about this definition in his post “Clarifying inner alignment terminology”,
        Outer Alignment: An objective function $r$ is outer aligned if all models that perform optimally on $r$ in the limit of perfect training and infinite data are intent aligned.^[2]
        I deliberately avoided using the term “outer alignment” in the post because I wanted to be more precise and not get into a debate about whether the value specification problem matches this exact definition. (I think the definitions are subtly different but the difference is not very relevant for the purpose of the post.) Overall, I think the two problems are closely associated and solving one gets you a long way towards solving the other. In the post, I defined the value identification/specification problem as,
        I am mainly talking about the problem of how to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans.
        This was based on the Arbital entry for the value identification problem, which was defined as a
        subproblem category of value alignment which deals with pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes.
        I should say note that I used this entry as the primary definition in the post because I was not able to find a clean definition of this problem anywhere else.
        I’d appreciate if you clarified whether you are saying:
        That my definition of the value specification problem is different from how MIRI would have defined it in, say, 2017. You can use Nate Soares’ 2016 paper or their 2017 technical agenda to make your point.
        That my definition matches how MIRI used the term, but the value specification problem remains very hard and unsolved, and GPT-4 is not even a partial solution to this problem.
        That my definition matches how MIRI used the term, and we appear to be close to a solution to the problem, but a solution to the problem is not sufficient to solve the hard bits of the outer alignment problem.
        I’m more sympathetic to (3) than (2), and more sympathetic to (2) than (1), roughly speaking.