With each AND, the claim gets stronger and more unlikely, such that by the millionth proposition, it starts to feel awfully unlikely that corrigibility is really a broad basin of attraction after all! (Unless this intuitive argument is misleading, of course.)
I think there argument might be misleading in that local stability isn’t that rare in practice, because we aren’t drawing local stability independently across all possible directional derivatives around the proposed local minimum.
Gradient updates or self-modification will probably fall into a few (relatively) low-dimensional subspaces (because most possible updates are bad, which is part of why learning is hard). A basin of corrigibility is then just that, for already-intent-corrigible agents, the space of likely gradient updates is going to have local stability wrt corrigibility.
Separately, I think the informal reasoning goes: youcurrent probably wouldn’t take a pill that makes youfuture slightly more willing to murder people. Youcurrent will be particularly wary if youfuture will be presented with even more pill ingestion opportunities (a.k.a. algorithm modifications); youfuture will be even more willing to take more pills, as youfuture will be more okay with the prospect of wanting to murder people. So, even offered large immediate benefit, youcurrent should not take the pill.
I think this argument is sound, for a wide range of goal-directed agents which can properly reason about their embedded agency. So, for your intuitive argument to survive this reductio ad absurdum, what is the disanalogy with corrigibility in this situation?
Perhaps the AI might not reason properly about embedded agency and accidentally jump out of the basin. Or, perhaps the basin is small and the AI won’t land in it—corrigibility won’t be so important that it doesn’t get traded away for other benefits.
Thanks! I don’t quite follow what local extrema have to do with the argument here. Of course, if you have a system where subsystem S1 is fixed while subsystem S2 is an ML model, and S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can’t measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn’t about calculus, as far as I understand.
I’m also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?
I don’t take much solace in the murder-pill argument. I have a very complicated mix of instincts and desires and beliefs that interact in complicated ways to determine my behavior. If I reached in and made one dimension of my curiosity a bit higher, that seems pretty innocent, but what would be the downstream effects on my relationships, my political opinions, my moral compass? I have no idea. The only way to know for sure would be to simulate my whole mind with and without that change. Every time I read a word or think a thought, I’m subjecting my mind to an uncontrolled experiment. Maybe I’ll read a newspaper article about a comatose person, which makes me ponder the nature of consciousness, and for some reason or another it makes me think that murder is just a little bit less bad than I had thought previously. And having read that article, it’s too late, I can’t roll back my brain to my previous state—and from my new perspective, I wouldn’t want to. I guess AGIs can be rolled back to a previous state more easily than my brain can, but how would that monitoring system work? And what if 3 months elapsed between reading the article and the ensuing reflection about the nature of consciousness?
Anyway, I feel this particularly acutely because I’m not one of those people who discovered the one true ethical theory in childhood and think that it’s perfectly logical and airtight and obvious. I feel confused and uncertain about philosophy and ethics; my opinions have changed in the past and probably will again. So I’m biased; “value drift” feels unusually natural from my perspective. However, I have been very consistent in my opposition to murder :-)
S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can’t measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn’t about calculus, as far as I understand.
I’m not saying anything about an explicit representation of corrigibility. I’m saying the space of likely updates for an intent-corrigible system might form a “basin” with respect to our intuitive notion of corrigibility.
I’m also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?
I said relatively low-dimensional! I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have. This doesn’t necessarily mitigate your argument, but it seemed like an important refinement—we aren’t considering corrigibility along all dimensions—just those along which updates are likely to take place.
“value drift” feels unusually natural from my perspective
I agree value drift might happen, but I’m somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal.
I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have.
Fair enough. :-)
I agree value drift might happen, but I’m somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal.
I dunno, a system can be extremely powerful and even superintelligent without being omniscient. Also, as a system gets more intelligent, understanding itself becomes more difficult at the same time (in general). It is also impossible to anticipate the downstream consequences of, say, having an insight that you haven’t had yet. Well, not impossible, but it seems hard. I guess we can try to make an AGI with an architecture that somehow elegantly allows a simple way to extract and understand its goal system, such that it can make a general statement that such-and-such types of learning and insights will not impact its goals in a way that it doesn’t want, but that doesn’t seem likely by default—nobody seems to be working towards that end, except maybe MIRI. I sure wouldn’t know how to do that.
I think there argument might be misleading in that local stability isn’t that rare in practice
Surely this depends on the number of dimensions, with local stability being rarer the more dimensions you have. [Hence the argument that, in the infinite-dimensional limit, everything that would have been an “local minimum” is instead a saddle point.]
Maybe. What I was arguing was: just because all of the partial derivatives are 0 at a point, doesn’t mean it isn’t a saddle point. You have to check all of the directional derivatives; in two dimensions, there are uncountably infinitely many.
Thus, I can prove to you that we are extremely unlikely to ever encounter a valley in real life:
A valley must have a lowest point A.
For A to be a local minimum, all of its directional derivatives must be 0:
Direction N (north), AND
Direction NE (north-east), AND
Direction NNE, AND
Direction NNNE, AND
...
This doesn’t work because the directional derivatives aren’t probabilistically independent in real life; you have to condition on the underlying geological processes, instead of supposing you’re randomly drawing a topographic function from R2 to R.
For the corrigibility argument to go through, I claim we need to consider more information about corrigibility in particular.
I guess my issue is that corrigibility is an exogenous specification; you’re not just saying “the algorithm goes to a fixed point” but rather “the algorithm goes to this particular pre-specified point, and it is a fixed point”. If I pick a longitude and latitude with a random number generator, it’s unlikely to be the bottom of a valley. Or maybe this analogy is not helpful and we should just be talking about corrigibility directly :-P
I think there argument might be misleading in that local stability isn’t that rare in practice, because we aren’t drawing local stability independently across all possible directional derivatives around the proposed local minimum.
Gradient updates or self-modification will probably fall into a few (relatively) low-dimensional subspaces (because most possible updates are bad, which is part of why learning is hard). A basin of corrigibility is then just that, for already-intent-corrigible agents, the space of likely gradient updates is going to have local stability wrt corrigibility.
Separately, I think the informal reasoning goes: youcurrent probably wouldn’t take a pill that makes youfuture slightly more willing to murder people. Youcurrent will be particularly wary if youfuture will be presented with even more pill ingestion opportunities (a.k.a. algorithm modifications); youfuture will be even more willing to take more pills, as youfuture will be more okay with the prospect of wanting to murder people. So, even offered large immediate benefit, youcurrent should not take the pill.
I think this argument is sound, for a wide range of goal-directed agents which can properly reason about their embedded agency. So, for your intuitive argument to survive this reductio ad absurdum, what is the disanalogy with corrigibility in this situation?
Perhaps the AI might not reason properly about embedded agency and accidentally jump out of the basin. Or, perhaps the basin is small and the AI won’t land in it—corrigibility won’t be so important that it doesn’t get traded away for other benefits.
Thanks! I don’t quite follow what local extrema have to do with the argument here. Of course, if you have a system where subsystem S1 is fixed while subsystem S2 is an ML model, and S1 measures the corrigibility of S2 and does gradient ascent on corrigibility, then the system as a whole has a broad basin of attraction for corrigibility, for sure. But we can’t measure corrigibility as far as I know, so the corrigibility-basin-of-attraction is not a maximum or minimum of anything relevant here. So this isn’t about calculus, as far as I understand.
I’m also not convinced that the space of changes is low-dimensional. Imagine every possible insight an AGI could have in its operating lifetime. Each of these is a different algorithm change, right?
I don’t take much solace in the murder-pill argument. I have a very complicated mix of instincts and desires and beliefs that interact in complicated ways to determine my behavior. If I reached in and made one dimension of my curiosity a bit higher, that seems pretty innocent, but what would be the downstream effects on my relationships, my political opinions, my moral compass? I have no idea. The only way to know for sure would be to simulate my whole mind with and without that change. Every time I read a word or think a thought, I’m subjecting my mind to an uncontrolled experiment. Maybe I’ll read a newspaper article about a comatose person, which makes me ponder the nature of consciousness, and for some reason or another it makes me think that murder is just a little bit less bad than I had thought previously. And having read that article, it’s too late, I can’t roll back my brain to my previous state—and from my new perspective, I wouldn’t want to. I guess AGIs can be rolled back to a previous state more easily than my brain can, but how would that monitoring system work? And what if 3 months elapsed between reading the article and the ensuing reflection about the nature of consciousness?
Anyway, I feel this particularly acutely because I’m not one of those people who discovered the one true ethical theory in childhood and think that it’s perfectly logical and airtight and obvious. I feel confused and uncertain about philosophy and ethics; my opinions have changed in the past and probably will again. So I’m biased; “value drift” feels unusually natural from my perspective. However, I have been very consistent in my opposition to murder :-)
I’m not saying anything about an explicit representation of corrigibility. I’m saying the space of likely updates for an intent-corrigible system might form a “basin” with respect to our intuitive notion of corrigibility.
I said relatively low-dimensional! I agree this is high-dimensional; it is still low-dimensional relative to all the false insights and thoughts the AI could have. This doesn’t necessarily mitigate your argument, but it seemed like an important refinement—we aren’t considering corrigibility along all dimensions—just those along which updates are likely to take place.
I agree value drift might happen, but I’m somewhat comforted if the intent-corrigible AI is superintelligent and trying to prevent value drift as best it can, as an instrumental subgoal.
Fair enough. :-)
I dunno, a system can be extremely powerful and even superintelligent without being omniscient. Also, as a system gets more intelligent, understanding itself becomes more difficult at the same time (in general). It is also impossible to anticipate the downstream consequences of, say, having an insight that you haven’t had yet. Well, not impossible, but it seems hard. I guess we can try to make an AGI with an architecture that somehow elegantly allows a simple way to extract and understand its goal system, such that it can make a general statement that such-and-such types of learning and insights will not impact its goals in a way that it doesn’t want, but that doesn’t seem likely by default—nobody seems to be working towards that end, except maybe MIRI. I sure wouldn’t know how to do that.
Surely this depends on the number of dimensions, with local stability being rarer the more dimensions you have. [Hence the argument that, in the infinite-dimensional limit, everything that would have been an “local minimum” is instead a saddle point.]
Maybe. What I was arguing was: just because all of the partial derivatives are 0 at a point, doesn’t mean it isn’t a saddle point. You have to check all of the directional derivatives; in two dimensions, there are uncountably infinitely many.
Thus, I can prove to you that we are extremely unlikely to ever encounter a valley in real life:
A valley must have a lowest point A.
For A to be a local minimum, all of its directional derivatives must be 0:
Direction N (north), AND
Direction NE (north-east), AND
Direction NNE, AND
Direction NNNE, AND
...
This doesn’t work because the directional derivatives aren’t probabilistically independent in real life; you have to condition on the underlying geological processes, instead of supposing you’re randomly drawing a topographic function from R2 to R.
For the corrigibility argument to go through, I claim we need to consider more information about corrigibility in particular.
I guess my issue is that corrigibility is an exogenous specification; you’re not just saying “the algorithm goes to a fixed point” but rather “the algorithm goes to this particular pre-specified point, and it is a fixed point”. If I pick a longitude and latitude with a random number generator, it’s unlikely to be the bottom of a valley. Or maybe this analogy is not helpful and we should just be talking about corrigibility directly :-P