This is more of a problem if we’re using the model’s internal representation, not just its predictions.
But you aren’t directly using the model’s internal representation, are you? You are using it only to make predictions about the human’s preferences in some novel domain (e.g. over the consequences of novel kinds of plans).
It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)? I would be very surprised by that. It seems like if you succeed you must be able to robustly transfer lots of human judgments to unfamiliar situations. And for that kind of solution, it’s not clear how an understanding of particular aspects of human preferences really helps.
It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you’re right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)?
I think I mostly want some story for why the preference for autonomy is even in the model’s hypothesis space. It seems that if we’re already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it’s more of a problem if we aren’t confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).
But you aren’t directly using the model’s internal representation, are you? You are using it only to make predictions about the human’s preferences in some novel domain (e.g. over the consequences of novel kinds of plans).
It seems like it would be cleaner to discuss the whole thing in the context of transfer to a new domain, rather than talking about directly using the learned representation, unless I am missing some advantage of this framing.
Are you hoping to do transfer learning for human preferences in a way that depends on having a detailed understanding of those preferences (e.g. that depends in particular on a detailed understanding of the human preference for autonomy)? I would be very surprised by that. It seems like if you succeed you must be able to robustly transfer lots of human judgments to unfamiliar situations. And for that kind of solution, it’s not clear how an understanding of particular aspects of human preferences really helps.
I agree with this. Problems with learning this preference should cause the system to make bad predictions (I think I was confused when I wrote that this problem only shows up with internal representations). Now that I think about it, it seems like you’re right that a system that correctly learns abstract human preferences would also learn the preference for autonomy. So this is really a special case of zero-shot transfer learning of abstract preferences. My main motivation for specifically studying the preference for autonomy is that maybe you can turn a simple version of it into a model for corrigibility.
I think I mostly want some story for why the preference for autonomy is even in the model’s hypothesis space. It seems that if we’re already confident that the system can learn abstract preferences, then we could also be confident that the system can learn the preference for autonomy; but maybe it’s more of a problem if we aren’t confident of this (e.g. the system is only supposed to learn and optimize for fairly concrete preferences).