Raemon comments on Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)

Raemon 30 Sep 2025 21:27 UTC
11 points
1
(Seems like this is a case where we should just tag @Max Harms and see what he thinks in this context)
- Max Harms 3 Oct 2025 3:06 UTC
  7 points
  3
  Parent
  My read on what @PeterMcCluskey is trying to say: “Max’s work seems important and relevant to the question of how hard corrigibility is to get. He outlined a vision of corrigibility that, in the absence of other top-level goals, may be possible to truly instill in agents via prosaic methods, thanks to the notion of an attractor basin in goal space. That sense of possibility stands in stark opposition to the normal MIRI party-line of anti-naturality making things doomed. He also pointed out that corrigibility is likely to be a natural concept, and made significant progress in describing it. Why is this being ignored?”
  If I’m right about what Peter is saying, then I basically agree. I would not characterize it as “an engineering problem” (which is too reductive) but I would agree there are reasons to believe that it may be possible to achieve a corrigible agent without a major theoretical breakthrough. (If (1) I’m broadly right, (2) anti-naturality isn’t as strong as the attractor basin in practice, and (3) I’m not missing any big complications, which is a big set of ifs that I would not bet my career on, much less the world.)
  I think Nate and Eliezer don’t talk about my work out of a combination of having been very busy with the book and not finding my writing/argumentation compelling enough to update them away from their beliefs about how doomed things are because of the anti-naturality property.
  I think @StanislavKrym and @Lucius Bushnaq are pointing out that I think building corrigible agents is hard and risky, and that we have a lot to learn and probably shouldn’t be taking huge risks of building powerful AIs. This is indeed my position, and does not feel contrary to or solidly addressing Peter’s points.
  Lucius and @Mikhail Samin bring up anti-naturality. I wrote about this at length in CAST and basically haven’t significantly updated, so I encourage people to follow Lucius’ link if they want to read my full breakdown there. But in short, I do not feel like I have a handle on whether the anti-naturality property is a stronger repulsor than the corrigibility basin is an attractor in practice. There are theoretical arguments that pseudo-corrigible agents will become fully corrigible and arguments that they will become incorrigible and I think we basically just have to test it and (if it favors attraction) hope that this generalizes to superintelligence. (Again, this is so risky that I would much rather we not be building ASI in general.) I do not see why Nate and Eliezer are so sure that anti-naturality will dominate, and this is, I think, the central issue of confidence that Peter is trying to point at.
  (Aside: As I wrote in CAST, “anti-natural” is a godawful way of saying opposed-to-the-instrumentally-convergent-drives, since it doesn’t preclude anti-natural things being natural in various ways.)
  Anyone who I mischaracterized is encouraged to correct me. :)