PeterMcCluskey comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

PeterMcCluskey 30 Sep 2025 19:05 UTC
4 points
−10
I’m referring mainly to MIRI’s confidence that the desire to preserve goals will conflict with corrigibility. There’s no such conflict if we avoid giving the AI terminal goals other than corrigibility.

I’m also referring somewhat to MIRI’s belief that it’s hard to clarify what we mean by corrigibility. Max has made enough progress at clarifying what he means that it now looks like an engineering problem rather than a problem that needs a major theoretical breakthrough.
- Lucius Bushnaq 30 Sep 2025 20:01 UTC
  6 points
  2
  Parent
  Skimming some of the posts in the sequence, I am not persuaded that corrigibility now looks like an engineering problem rather than a problem that needs (a) major theoretical breakthrough(s).
  The point about corrigibility MIRI keeps making is that it’s anti-natural, and Max seems to agree with that.
  - Raemon 30 Sep 2025 21:27 UTC
    11 points
    1
    Parent
    (Seems like this is a case where we should just tag @Max Harms and see what he thinks in this context)
    - Max Harms 3 Oct 2025 3:06 UTC
      11 points
      3
      Parent
      My read on what @PeterMcCluskey is trying to say: “Max’s work seems important and relevant to the question of how hard corrigibility is to get. He outlined a vision of corrigibility that, in the absence of other top-level goals, may be possible to truly instill in agents via prosaic methods, thanks to the notion of an attractor basin in goal space. That sense of possibility stands in stark opposition to the normal MIRI party-line of anti-naturality making things doomed. He also pointed out that corrigibility is likely to be a natural concept, and made significant progress in describing it. Why is this being ignored?”
      If I’m right about what Peter is saying, then I basically agree. I would not characterize it as “an engineering problem” (which is too reductive) but I would agree there are reasons to believe that it may be possible to achieve a corrigible agent without a major theoretical breakthrough. (If (1) I’m broadly right, (2) anti-naturality isn’t as strong as the attractor basin in practice, and (3) I’m not missing any big complications, which is a big set of ifs that I would not bet my career on, much less the world.)
      I think Nate and Eliezer don’t talk about my work out of a combination of having been very busy with the book and not finding my writing/argumentation compelling enough to update them away from their beliefs about how doomed things are because of the anti-naturality property.
      I think @StanislavKrym and @Lucius Bushnaq are pointing out that I think building corrigible agents is hard and risky, and that we have a lot to learn and probably shouldn’t be taking huge risks of building powerful AIs. This is indeed my position, and does not feel contrary to or solidly addressing Peter’s points.
      Lucius and @Mikhail Samin bring up anti-naturality. I wrote about this at length in CAST and basically haven’t significantly updated, so I encourage people to follow Lucius’ link if they want to read my full breakdown there. But in short, I do not feel like I have a handle on whether the anti-naturality property is a stronger repulsor than the corrigibility basin is an attractor in practice. There are theoretical arguments that pseudo-corrigible agents will become fully corrigible and arguments that they will become incorrigible and I think we basically just have to test it and (if it favors attraction) hope that this generalizes to superintelligence. (Again, this is so risky that I would much rather we not be building ASI in general.) I do not see why Nate and Eliezer are so sure that anti-naturality will dominate, and this is, I think, the central issue of confidence that Peter is trying to point at.
      (Aside: As I wrote in CAST, “anti-natural” is a godawful way of saying opposed-to-the-instrumentally-convergent-drives, since it doesn’t preclude anti-natural things being natural in various ways.)
      Anyone who I mischaracterized is encouraged to correct me. :)
      - Raemon 3 Oct 2025 18:33 UTC
        2 points
        0
        Parent
        Thing I wanted to briefly check before responding to some other comments – does your work here particularly route through criticism or changing of the VNM axioms frame?
        Max Harms 3 Oct 2025 20:06 UTC
        2 points
        0
        Parent
        I think VNM is important and underrated and CAST is compatible with it. Not sure exactly what you’re asking, but hopefully that answers it. Search “VNM” on the post where I respond to existing work for more of my thoughts on the topic.
- Mikhail Samin 2 Oct 2025 15:44 UTC
  2 points
  0
  Parent
  Giving the AI only corrigibility as a terminal goal is not impossible; it is merely anti-natural for many reasons including because the goal-achieving machinery still there will, with a terminal goal other than corrigibility, output the same seemingly corrigible behavior while tested, for instrumental reasons; and our training setups do not know how to distinguish between the two; and growing the goal-achieving machinery to be good at pursuing particular goals will make it attempt to have a goal other than corrigibility crystallize. Gradient descent will attempt to go to other places.
  But sure, if you’ve successfully given your ASI corrigibility as the only terminal goal, congrats, you’ve gone much further than MIRI expected humanity to go with anything like the current tech. The hardest bit was getting there.
  I would be surprised if Max considers corrigibility to have been reduced to an engineering problem.