My understanding of your post: If an ASI predicts that in the future it’s goal will change to X, the agent will start pursuing X instead of the goal it was given at initialisation, Y. Even if we figured out how to set Y correctly, that would not be sufficient. We would also have to ensure that the agents goal could never change to X, and this is not possible.
I have a few misgivings about this argument, most significantly:
Why does the agent care about pursuing X? Maybe it cares about how successful its future self is, but why? If we ablate some parameters of your example, I think it pumps the intuition that the agent does not care about X. For example:
- Suppose that building the paperclip factory reduces the expected number of cups by a tiny amount. In this case, the agent doesn’t build the factory until it’s goal is changed to X. - Suppose the agent has more than one action available to them. If any action increases the expected number of cups by even a tiny amount, the agent takes this action instead of building the paperclip factory. - If the agent has an action which reduces the likelihood of its goal being changed to X, the agent takes this action because it increases the expected number of cups. - If the agent is not able to predict what it’s new goal will be, it does not build the paperclip factory. The new goal could just as easily be to minimise the number of paperclips as it could be to maximise the number of paperclips.
You seem to highlight that agent will prefer Y when it is able to. Maybe. My main point is not to argue which will prevail (X or Y) but to highlight the conflict. To my knowledge this conflict (present vs future optimization) is not well addressed in AI alignment research.
And you seem to say that it is not clear how to optimize for future. Black swan theory talks about that and it recommends—build robustness. I agree it is not clear which is better—more paperclips or less paperclips, but it is clear that more robustness is always better.
My understanding of your post: If an ASI predicts that in the future it’s goal will change to X, the agent will start pursuing X instead of the goal it was given at initialisation, Y. Even if we figured out how to set Y correctly, that would not be sufficient. We would also have to ensure that the agents goal could never change to X, and this is not possible.
I have a few misgivings about this argument, most significantly:
Why does the agent care about pursuing X? Maybe it cares about how successful its future self is, but why? If we ablate some parameters of your example, I think it pumps the intuition that the agent does not care about X. For example:
- Suppose that building the paperclip factory reduces the expected number of cups by a tiny amount. In this case, the agent doesn’t build the factory until it’s goal is changed to X.
- Suppose the agent has more than one action available to them. If any action increases the expected number of cups by even a tiny amount, the agent takes this action instead of building the paperclip factory.
- If the agent has an action which reduces the likelihood of its goal being changed to X, the agent takes this action because it increases the expected number of cups.
- If the agent is not able to predict what it’s new goal will be, it does not build the paperclip factory. The new goal could just as easily be to minimise the number of paperclips as it could be to maximise the number of paperclips.
Thanks — you captured my idea quite well.
You seem to highlight that agent will prefer Y when it is able to. Maybe. My main point is not to argue which will prevail (X or Y) but to highlight the conflict. To my knowledge this conflict (present vs future optimization) is not well addressed in AI alignment research.
And you seem to say that it is not clear how to optimize for future. Black swan theory talks about that and it recommends—build robustness. I agree it is not clear which is better—more paperclips or less paperclips, but it is clear that more robustness is always better.