This seems like it is not about the “motivational system”, and if this were implemented in a robot that does have a separate “motivational system” (i.e. it is goal-directed), I worry about a nearest unblocked strategy.
I am confused about where you think the motivation system comes into my statement. It sounds like you are imagining that what I said is a constraint, which could somehow be coupled with a seperate motivation system. If that’s your interpretation, that’s not what I meant at all, unless random sampling counts as a motivation system. I’m saying that all you do is sample from what’s consented to.
But, maybe what you are saying is that in “the intersection of what the user expects and what the user wants”, the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system). If that’s what you meant, I think that’s a valid concern. What I was imagining is that you are trying to infer “what the user wants” not in terms of end goals, but rather in terms of actions (really, policies) for the AI. So, it is more like an approval-directed agent to an extent. If the human says “get me groceries”, the job of the AI is not to infer the end state the human is asking the robot to optimize for, but rather, to infer the set of policies which the human is trying to point at.
There’s no optimization on top of this finding perverse instantiations of the constraints; the AI just follows the policy which it infers the human would like. Of course the powerful learning system required for this to work may perversely instantiate these beliefs (ie, there may be daemons aka inner optimizers).
(The most obvious problem I see with this approach is that it seems to imply that the AI can’t help the human do anything which the human doesn’t already know how to do. For example, if you don’t know how to get started filing your taxes, then the robot can’t help you. But maybe there’s some way to differentiate between more benign cases like that and less benign cases like using nanotechnology to more effectively get groceries?)
A third interpretation of your concern is that you’re saying that if the thing is doing well enough to get groceries, there has to be powerful optimization somewhere, and wherever it is, it’s going to be pushing toward perverse instantiations one way or another. I don’t have any argument against this concern, but I think it mostly amounts to a concern about inner optimizers.
(I feel compelled to mention again that I don’t feel strongly that the whole idea makes any sense. I just want to convey why I don’t think it’s about constraining an underlying motivation system.)
But, maybe what you are saying is that in “the intersection of what the user expects and what the user wants”, the first is functioning as a constraint, and the second is functioning as a motivation system (basically the usual IRL motivation system).
This is basically what I meant. Thanks for clarifying that you meant something else.
The most obvious problem I see with this approach is that it seems to imply that the AI can’t help the human do anything which the human doesn’t already know how to do.
Yeah, this is my concern with the thing you actually meant. (It’s also why I incorrectly assumed that “what the user wants” was meant to be goal-directed optimization, as opposed to about policies the user approves of.) It could work combined with something like amplification where you get to assume that the overseer is smarter than the agent, but then it’s not clear if the part about “what the user expects” buys you anything over the “what the user wants” part.
This does seem like a concern, but it wasn’t the one I was thinking about. It also seems like a concern about basically any existing proposal. Usually when talking about concerns I don’t bring up the ones that are always concerns, unless someone explicitly claims that their solution obviates that concern.