Towards_Keeperhood comments on Shah and Yudkowsky on alignment failures

Towards_Keeperhood 27 May 2025 10:18 UTC
1 point
0
- I agree Eliezer likely wouldn’t want “corrigibility” to refer to the thing I’m imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
Yeah thanks for distinguishing. It’s not at all obvious to me that Paul would call CIRL “corrigible”—I’d guess not, but idk.
My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It’s possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain.
- I disagree that in early-CIRL “the AI doesn’t already know its own values and how to accomplish them better than the operators”. It knows that its goal is to optimize the human’s utility function, and it can be better than the human at eliciting that utility function. It just doesn’t have perfect information about what the human’s utility function is.
Sorry that was very poorly phrased by me. What I meant was “the AI doesn’t already know how to evaluate what’s best according to its own values better than the operators”. So yes I agree. I still find it confusing though why people started calling that corrigibility.
In your previous comment you wrote:
I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.
I don’t understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn’t. (Of course it doesn’t matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it’s not correctable, and I don’t see why you think it’s evidence.)
- I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don’t know what you mean here.
Also, in my view corrigibility isn’t just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn’t:
If something goes wrong with CIRL so its goal isn’t pointed to the human utility function anymore, it would not want operators to correct it.
~~The~~ One central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven’t found it yet.)
E.g. if you look at the corrigibility paper, you could imagine that if they actually found a utility function combined from U_normal and U_shutdown with the desireable properties, it would stay shutdownable if U_normal changed in an undesirable way (e.g. in case it rebinds incorrectly after an ontology shift).
Though another way you can keep being able to correct the AI’s goals is by having the AI not think much in the general domain about stuff like “the operators may change my goals” or so.
(Most of the corrigibility principles are about a different part of corrigibility, but I think this “be able to correct the AI even if something goes a bit wrong with its alignment” is a central part of corrigibility.)
I’m not quite sure if you’re trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.
Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.
- Rohin Shah 28 May 2025 15:01 UTC
  3 points
  0
  Parent
  I think you are being led astray by having a one-dimensional notion of intelligence.
  What I meant was “the AI doesn’t already know how to evaluate what’s best according to its own values better than the operators”.
  Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.
  CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]
  It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn’t.
  I disagree that this property necessarily goes away as soon as the AI is “smarter” or has “more common sense”. You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
  Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
  Why can’t this apply in the AI / human case?
  I still find it confusing though why people started calling that corrigibility.
  I’m not calling that property corrigibility, I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.
  - Towards_Keeperhood 29 May 2025 15:53 UTC
    1 point
    0
    Parent
    Thanks.
    I think you are being led astray by having a one-dimensional notion of intelligence.
    (I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won’t help you much for the parts of the problem we are most bottlenecked on.)
    You identified the key property yourself: it’s that the humans have an advantage over the AI at (particular parts of) evaluating what’s best. (More precisely, it’s that the humans have information that the AI does not have; it can still work even if the humans don’t use their information to evaluate what’s best.)
    I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn’t be large.
    Or do you imagine strategically keeping some information from the AI?
    Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)
    Even if the alignment works out perfectly, when the AI is smarter and the humans are like “actually we want to shut you down”, the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn’t actually going to happen, it can just be like “sorry, that’s not actually in your extrapolated interests, you will perhaps understand later when you’re smarter”, and then tries to fulfill human values.
    But if we’re confident alignment to humans will work out we don’t need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.
    If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don’t find out or won’t (be able to) change its values back, because that’s the strategy that’s best according to the AI’s new values.
    Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)
    Likewise just updating on new information, not changing terminal goals.
    Also note that parents often think (sometimes correctly) that they better know what is in the child’s extrapolated interests and then don’t act according to the child’s stated wishes.
    And I think superhumanly smart AIs will likely be better at guessing what is in a human’s interests than parents guessing what is in their child’s interest, so the cases where the strategy gets updated are less significant.
    I’m saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about “naturalness” of corrigibility.
    From my perspective CIRL doesn’t really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)
    I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.