Petr Kašpárek comments on A Timing Problem for Instrumental Convergence

Petr Kašpárek 31 Aug 2025 10:06 UTC
1 point
0
I would add two things.
First, the myopia has to be really extreme. If the agent planned at least two steps ahead, it would be incentivized to keep its current goal. Changing the goal in the first step could make it take a bad second step.^[1]
Second, the original argument is about the could, not the would. The possibility of changing the goal, not the necessity. In practice, I would assume a myopic AI would not be very capable and thus self modification and changing goals would be far beyond its capabilities.
1. ^
  There is an exception to this. If the new goal still makes the agent take an optimal action in the second step, it can change to it.
  For example, if the paperclip maximizer has no materials (and due to its myopia can’t really plan to obtain any), it can change its goal while it’s idling because all actions make zero paperclips.
  A more sophisticated example. Suppose the goal is “make paperclips and don’t kill anyone.” (If we wanted to frame it as a utility function, we could say: number of paperclips $-$ killed people $\times$ a very large number.) Suppose an optimal two-step plan is: 1. obtain materials 2. make paperclips. However, what if, in the first step, the agent changes its goal to just making paperclips. As long as there is no possible action in the second step that makes more paperclips while killing people, the agent will take the same action in the second step even with the changed goal. Thus changing the goal in the first step is also an optimal action.