I second Petr’s comment: Your definition relates to myopic agents. Consider two utility functions for a paperclip-maximizer:
Myopic paperclip-maximizer: Utility is the number of paperclips in existence right now
Paperclip-maximizer: Utility is the number of paperclips that will eventually exist
A myopic paperclip-maximizer will suffer from the timing problem you described: When faced with an action that creates a superior number of paperclips and also changes the utility function, the myopic maximizer will take this action.
The standard paperclip-maximizer will not. It considers not just the actions it can take right now, but all actions throughout the future. Crucially, it evaluates these actions against the current goal, not the goal it would have at that time. It does not evaluate these actions against what utility the maximizer would later have.
First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.[1]
Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.
There is an exception to this. If the new goal still makes the agent take an optimal action in the second step, it can change to it.
For example, if the paperclip maximizer has no materials (and due to its myopia can’t really plan to obtain any), it can change its goal while it’s idling because all actions make zero paperclips.
A more sophisticated example. Suppose the goal is “make paperclips and don’t kill anyone.” (If we wanted to frame it as a utility function, we could say: number of paperclips − killed people × a very large number.) Suppose an optimal two-step plan is: 1. obtain materials 2. make paperclips. However, what if, in the first step, the agent changes its goal to just making paperclips. As long as there is no possible action in the second step that makes more paperclips while killing people, the agent will take the same action in the second step even with the changed goal. Thus changing the goal in the first step is also an optimal action.
The timing problem is not a problem for agents. It’s a problem for the claim that goal preservation is instrumentally required for rational agents. The timing problem doesn’t force agents to take any particular decision. The argument is that it’s not instrumentally irrational for a rational agent to abandon its goal. It isn’t about any specific utility functions, and it isn’t a prediction about what an agent will do.
The timing problem is a problem for how well we can predict the actions of myopic agents: Any agent that has a myopic utility function has no instrumental convergent reason for goal preservation.
The reason I suspect you haven’t is that whether an agent is “myopic” or not is irrelevant to the argument. Where we may disagree is over the nature of goal having, as Seth Herd pointed out. If you want to find a challenge to the argument, that’s the place to look.
I second Petr’s comment: Your definition relates to myopic agents. Consider two utility functions for a paperclip-maximizer:
Myopic paperclip-maximizer: Utility is the number of paperclips in existence right now
Paperclip-maximizer: Utility is the number of paperclips that will eventually exist
A myopic paperclip-maximizer will suffer from the timing problem you described: When faced with an action that creates a superior number of paperclips and also changes the utility function, the myopic maximizer will take this action.
The standard paperclip-maximizer will not. It considers not just the actions it can take right now, but all actions throughout the future. Crucially, it evaluates these actions against the current goal, not the goal it would have at that time. It does not evaluate these actions against what utility the maximizer would later have.
I would add two things.
First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.[1]
Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.
There is an exception to this. If the new goal still makes the agent take an optimal action in the second step, it can change to it.
For example, if the paperclip maximizer has no materials (and due to its myopia can’t really plan to obtain any), it can change its goal while it’s idling because all actions make zero paperclips.
A more sophisticated example. Suppose the goal is “make paperclips and don’t kill anyone.” (If we wanted to frame it as a utility function, we could say: number of paperclips − killed people × a very large number.) Suppose an optimal two-step plan is: 1. obtain materials 2. make paperclips. However, what if, in the first step, the agent changes its goal to just making paperclips. As long as there is no possible action in the second step that makes more paperclips while killing people, the agent will take the same action in the second step even with the changed goal. Thus changing the goal in the first step is also an optimal action.
The timing problem is not a problem for agents. It’s a problem for the claim that goal preservation is instrumentally required for rational agents. The timing problem doesn’t force agents to take any particular decision. The argument is that it’s not instrumentally irrational for a rational agent to abandon its goal. It isn’t about any specific utility functions, and it isn’t a prediction about what an agent will do.
The timing problem is a problem for how well we can predict the actions of myopic agents: Any agent that has a myopic utility function has no instrumental convergent reason for goal preservation.
Have you read the paper?
I did read 2/3rd of the paper, and I tried my best to understand it, but apparently I failed.
The reason I suspect you haven’t is that whether an agent is “myopic” or not is irrelevant to the argument. Where we may disagree is over the nature of goal having, as Seth Herd pointed out. If you want to find a challenge to the argument, that’s the place to look.
It is possible that we also disagree on the nature of goal having. I reserve the right to find my own places to challenge your argument.
Ha, yes, fair enough