One could make a similar argument for corrigibility: ambitious value learning would respect our desire for it to behave corrigibly if we actually wanted that, and if we didn’t want that, why impose it?
There’s a disanalogy in that autonomy is probably a terminal value whereas corrigibility is only an instrumental one. In other words, I don’t want a corrigible AI for the sake of having a corrigible AI, I want one so it will help me reach my other goals. I do (probably) want autonomy, and not only because it would help me reach other goals. So in fact ambitious value learning will not learn to behave corrigibly, I think, because the AI will probably think it has a better way of giving me what I ultimately want.
Oh, I think I see a different way of stating your argument that avoids this disanalogy: we’re not concerned about autonomy as a terminal value here, but as an instrumental one like corrigibility. If ambitious value learning works perfectly, then it would learn autonomy as a terminal value, but we want to implement autonomy-respecting AI mainly because that would help us get what we want in case ambitious value learning fails to works perfectly.
I think I understand the basic idea and motivation now, and I’ll just point out that autonomy-respecting AI seems share several problems with other non-goal-directed approaches to AI safety.
There’s a disanalogy in that autonomy is probably a terminal value whereas corrigibility is only an instrumental one. In other words, I don’t want a corrigible AI for the sake of having a corrigible AI, I want one so it will help me reach my other goals. I do (probably) want autonomy, and not only because it would help me reach other goals. So in fact ambitious value learning will not learn to behave corrigibly, I think, because the AI will probably think it has a better way of giving me what I ultimately want.
Oh, I think I see a different way of stating your argument that avoids this disanalogy: we’re not concerned about autonomy as a terminal value here, but as an instrumental one like corrigibility. If ambitious value learning works perfectly, then it would learn autonomy as a terminal value, but we want to implement autonomy-respecting AI mainly because that would help us get what we want in case ambitious value learning fails to works perfectly.
I think I understand the basic idea and motivation now, and I’ll just point out that autonomy-respecting AI seems share several problems with other non-goal-directed approaches to AI safety.