Towards_Keeperhood comments on Any corrigibility naysayers outside of MIRI?

Towards_Keeperhood 24 Oct 2025 14:28 UTC
1 point
0
What exactly do you mean by corrigibility here? Getting an AI to steer towards a notion of human empowerment in a way we can ask it to solve uploading and it does that without leading to bad results? Or getting an AI that has solve-uploading levels of capability but still would let us shut it down without resisting (even if it didn’t complete its task yet). And if the latter, does it need to be in a clean way, or can it be in a messy way like that we just trained really really hard to make the AI not think about the offswitch and it somehow surprisingly ended up working?
I think the way most alignment researchers (probably including Paul Christiano) would approach training for corrigibility is relatively unlikely to work in time, because they think more in terms of behavior generalization rather than steering systems, and I guess they wouldn’t aim well enough at getting coherent corrigible steering patterns to make the systems corrigible at high levels of optimization power.
It’s possible there’s a smarter way that has better chances.
Pretty unsure about both of those though.
- Max Harms 24 Oct 2025 16:29 UTC
  2 points
  0
  Parent
  I’m talking about the concept that I discuss in CAST. (You may want to skim some of post #2, which has intuition.)
  - Towards_Keeperhood 24 Oct 2025 16:49 UTC
    1 point
    0
    Parent
    Yeah I suppose I could’ve guessed that.
    I read your sequence in the past but I didn’t think carefully enough about this to evaluate this.