does it have an incentive to manipulate the external world to make episode 117 happen many times somehow
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.
For any given world-model, episode 117 is just a string of actions on the input tape, and observations and rewards on the output tape (positions (m+1)*117 through (m+1)*118 −1, if you care). In none of these world-models, under no actions that it considers, does “episode 117 happen twice.”
Yes, episode 117 happens only once in the world model; and suppose the agent cares only about episode 117 in the “current execution”. The concern still holds: the agent might write a malign output that would result in additional invocations of itself in which episode 117 ends with the agent getting a high reward. Note that the agent does not care about the other executions of itself. The only purpose of the malign output is to increase the probability that the “current execution” is one that ends with the agent receiving a high reward.
Okay so I think you could construct a world-model that reflects this sort of reasoning, where it associates reward with the reward provided to a randomly sampled instance of its algorithm in the world in a way that looks like this. But the “malign output that would result in additional invocations of itself” would require the operator to leave the room, so this has the same form as, for example, ν†. At this point, I think we’re no longer considering anything that sounds like “episode 117 happening twice,” but that’s fine. Also, just a side-note: this world-model would get ruled out if the rewards/observations provided to the two separate instances ever diverge.