Towards_Keeperhood comments on Shah and Yudkowsky on alignment failures

Towards_Keeperhood 20 May 2025 20:01 UTC
1 point
0
I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.
Yeah fair point. I don’t really know what Paul means with corrigibility. (One hypothesis: Paul doesn’t think in terms of consequentialist cognition but in terms of learned behaviors that generalize, and maybe the question “but does it behave that way because it wants the operator’s values to be fulfilled or because it just wants to serve?” seems meaningless from Paul’s perspective. But idk.)
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I’m pretty sure Eliezer would not want the term “corrigibility” to be used for the kind of correctability you get in the early stages of CIRL when the AI doesn’t already know its own values and how to accomplish them better than the operators. (Eliezer actually talked a bunch about this CIRL-like correctability in his 2001 report “Creating Friendly AI”. (Probably not worth your time to read, though given the context that it was 2001, there seemed to me to be some good original thinking going on there which I didn’t see often. Also you can see Eliezer being optimistic about alignment.))
And I don’t see it as evidence that Eliezer!corrigibility isn’t anti-natural.
(In the following I use “corrigibility” in the Eliezer-sense. I’m pretty confident that all of the following matches Eliezer’s model, but not completely sure.)
The motivation behind corrigibility was that aligning superintelligence seemed to hard, so we want to aim an AI to do a pivotal task that gets humanity on a course to likely properly aligning superintelligence later.
The corrigible AI would be just pointed to accomplish this task, and not to human values at all. It should be this bounded thing that only cares about this bounded task and afterwards shuts itself down. It shouldn’t do the task because it wants to accomplish human values and the task seems like a good way to accomplish it. Human values are unbounded, and it might be less likely shut itself down afterwards. Corrigibility has nothing to do with human values.
Roughly speaking, we can perhaps disentangle 3 corrigibility approaches:
1. Train for corrigible behavior.
  1. I think Eliezer thinks that this will only create behavioral heuristics that won’t get integrated into the optimization target of the powerful optimizer, and the optimizer will see those as constraints to find ways around or remove. Since doing a pivotal act requires a lot of optimization power, it might find a way around those constraints, or use the nearest unblocked strategy which might still be undesireable.
  2. (There might also be downsides of training for corrigible behavior, e.g. the optimization becoming less understandable and less predictable.)
2. Integrate corrigibility principles into the optimization.
  1. These approaches are about trying to design the way the optimization works in ways that make it safer and less likely to blow up.
3. Coherent corrigibility / The hard problem of corrigibility.
  1. If a solution here would be found it might have the shape of a utility function saying “serve the operators”. Not “serve because you want the operators values to be fulfilled”. (Less sure here whether I understand this correctly.)
  2. I think Max Harms’ is trying to make some progress on this.
The main plan isn’t to try to get coherent corrigibility, but just to build something limited that optimizes in a way it can still get something pivotal done without wanting to take over the universe. Not that it has a coherent goal where the optimum wouldn’t be taking over the universe—it rather just doesn’t think those thoughts and just does its task.
Ideal would be something that doesn’t think in the general domain at all. E.g. imagine sth like AlphaFold 5 that isn’t trained on text at all and is only very good at modelling protein interactions, which could e.g. help us get relevant understanding about neuronal cell dynamics which we could use for significantly enhancing adult human intelligence - (I’m just sketching silly unrealistic sorta-concrete scenario). But seems unlikely we will able to do something impressive with narrow reasoners at our level of understanding.
But even though we don’t aim for a coherent mind, if more parts that make the AI safe/corrigible have a coherent shape, e.g. if we find a working shutdown-utility function, that still improves safety, because it means those parts of the AI don’t obviously break in the limit of optimization pressure, so it’s also less probable to break through “only” pivotal levels of optimization.
What links here?
- Towards_Keeperhood's comment on Towards_Keeperhood’s Shortform by Towards_Keeperhood (17 May 2025 12:03 UTC; 13 points)
- Rohin Shah 27 May 2025 6:19 UTC
  4 points
  2
  Parent
  Not a full response, but some notes:
  - I agree Eliezer likely wouldn’t want “corrigibility” to refer to the thing I’m imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
  - I disagree that in early-CIRL “the AI doesn’t already know its own values and how to accomplish them better than the operators”. It knows that its goal is to optimize the human’s utility function, and it can be better than the human at eliciting that utility function. It just doesn’t have perfect information about what the human’s utility function is.
  - I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
  - I find it pretty plausible that shutdown corrigibility is especially anti-natural. Relatedly, (1) most CIRL agents will not satisfy shutdown corrigibility even at early stages, (2) most of the discussion on Paul!corrigibility doesn’t emphasize or even mention shutdown corrigibility.
  - I agree Eliezer has various strategic considerations in mind that bear on how he thinks about corrigibility. I mostly don’t share those considerations.
  - I’m not quite sure if you’re trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else. If it’s (1), you’ll need to understand my strategic considerations (you can pretend I’m Paul, that’s not quite accurate but it covers a lot). If it’s (2), I would focus elsewhere, I have spent quite a lot of time engaging with the Eliezer / Nate perspective.
  - Towards_Keeperhood 27 May 2025 10:18 UTC
    1 point
    0
    Parent
    I agree Eliezer likely wouldn’t want “corrigibility” to refer to the thing I’m imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
    Yeah thanks for distinguishing. It’s not at all obvious to me that Paul would call CIRL “corrigible”—I’d guess not, but idk.
    My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It’s possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain.
    I disagree that in early-CIRL “the AI doesn’t already know its own values and how to accomplish them better than the operators”. It knows that its goal is to optimize the human’s utility function, and it can be better than the human at eliciting that utility function. It just doesn’t have perfect information about what the human’s utility function is.
    Sorry that was very poorly phrased by me. What I meant was “the AI doesn’t already know how to evaluate what’s best according to its own values better than the operators”. So yes I agree. I still find it confusing though why people started calling that corrigibility.
    In your previous comment you wrote:
    I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
    I don’t understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn’t. (Of course it doesn’t matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it’s not correctable, and I don’t see why you think it’s evidence.)
    I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
    I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don’t know what you mean here.
    Also, in my view corrigibility isn’t just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn’t:
    If something goes wrong with CIRL so its goal isn’t pointed to the human utility function anymore, it would not want operators to correct it.
    ~~The~~ One central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven’t found it yet.)
    E.g. if you look at the corrigibility paper, you could imagine that if they actually found a utility function combined from U_normal and U_shutdown with the desireable properties, it would stay shutdownable if U_normal changed in an undesirable way (e.g. in case it rebinds incorrectly after an ontology shift).
    Though another way you can keep being able to correct the AI’s goals is by having the AI not think much in the general domain about stuff like “the operators may change my goals” or so.
    (Most of the corrigibility principles are about a different part of corrigibility, but I think this “be able to correct the AI even if something goes a bit wrong with its alignment” is a central part of corrigibility.)
    I’m not quite sure if you’re trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.
    Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.
    - Rohin Shah 28 May 2025 15:01 UTC
      3 points
      0
      Parent
      I think you are being led astray by having a one-dimensional notion of intelligence.
      What I meant was “the AI doesn’t already know how to evaluate what’s best according to its own values better than the operators”.
      Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.
      CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]
      It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn’t.
      I disagree that this property necessarily goes away as soon as the AI is “smarter” or has “more common sense”. You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
      Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
      Why can’t this apply in the AI / human case?
      I still find it confusing though why people started calling that corrigibility.
      I’m not calling that property corrigibility, I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.
      - Towards_Keeperhood 29 May 2025 15:53 UTC
        1 point
        0
        Parent
        Thanks.
        I think you are being led astray by having a one-dimensional notion of intelligence.
        (I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won’t help you much for the parts of the problem we are most bottlenecked on.)
        You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
        I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn’t be large.
        Or do you imagine strategically keeping some information from the AI?
        Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)
        Even if the alignment works out perfectly, when the AI is smarter and the humans are like “actually we want to shut you down”, the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn’t actually going to happen, it can just be like “sorry, that’s not actually in your extrapolated interests, you will perhaps understand later when you’re smarter”, and then tries to fulfill human values.
        But if we’re confident alignment to humans will work out we don’t need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.
        If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don’t find out or won’t (be able to) change its values back, because that’s the strategy that’s best according to the AI’s new values.
        Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
        Likewise just updating on new information, not changing terminal goals.
        Also note that parents often think (sometimes correctly) that they better know what is in the child’s extrapolated interests and then don’t act according to the child’s stated wishes.
        And I think superhumanly smart AIs will likely be better at guessing what is in a human’s interests than parents guessing what is in their child’s interest, so the cases where the strategy gets updated are less significant.
        I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.
        From my perspective CIRL doesn’t really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)
        I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.