Max Harms comments on Any corrigibility naysayers outside of MIRI?

Max Harms 23 Oct 2025 17:44 UTC
3 points
0
Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I’m asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
- Random Developer 24 Oct 2025 3:54 UTC
  3 points
  0
  Parent
  How much work is “stable” doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or “nearline” fine tuning.
  
  It might also matter whether “corrigible” means “we can genuinely change the AI’s goals” or “we have trained the model not to ex-filtrate its weights when someone is looking.” Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
  - Max Harms 24 Oct 2025 16:25 UTC
    3 points
    0
    Parent
    I think that if someone built a weak superintelligence that’s corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn’t mean there isn’t any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
    Political fights over control of the agent
    Pushing the agent too hard/fast to learn and grow
    Kicking the agent out of the CAST framework by trying to make it good in addition to corrigible
    Having the agent train a bunch of copies/successors
    Having the agent face off against an intelligent adversary
    Telling the agent to think hard in directions where we can no longer follow its thought process
    Redefining the notion of principal
    Giving the agent tasks of indefinite scope and duration
    System sabotage from enemies
    Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)