TurnTrout comments on Three mental images from thinking about AGI debate & corrigibility

TurnTrout 4 Aug 2020 13:57 UTC
LW: 4 AF: 2
AF
S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can’t measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn’t about calculus, as far as I understand.
I’m not saying anything about an explicit representation of corrigibility. I’m saying the space of likely updates for an intent-corrigible system might form a “basin” with respect to our intuitive notion of corrigibility.
I’m also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?
I said relatively low-dimensional! I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have. This doesn’t necessarily mitigate your argument, but it seemed like an important refinement—we aren’t considering corrigibility along all dimensions—just those along which updates are likely to take place.
“value drift” feels unusually natural from my perspective
I agree value drift might happen, but I’m somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal.
- Steven Byrnes 4 Aug 2020 19:59 UTC
  LW: 2 AF: 1
  AF Parent
  I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have.
  Fair enough. :-)
  I agree value drift might happen, but I’m somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal.
  I dunno, a system can be extremely powerful and even superintelligent without being omniscient. Also, as a system gets more intelligent, understanding itself becomes more difficult at the same time (in general). It is also impossible to anticipate the downstream consequences of, say, having an insight that you haven’t had yet. Well, not impossible, but it seems hard. I guess we can try to make an AGI with an architecture that somehow elegantly allows a simple way to extract and understand its goal system, such that it can make a general statement that such-and-such types of learning and insights will not impact its goals in a way that it doesn’t want, but that doesn’t seem likely by default—nobody seems to be working towards that end, except maybe MIRI. I sure wouldn’t know how to do that.