Finally, notice that there seems to be little risk involved in specifying the class of counterparts more broadly than is needed, given that there are few circumstances in which an agent needs to prefer to create some counterparts but not others in order to be useful.
Could a risk of specifying counterparts too broadly be that the agent is incentivized to kill existing counterparts in order to create more aligned counterparts? For example, Alice’s AI might want to kill Bob’s AI (considered a counterpart) so that Alice’s AI can replicate itself and do twice as much work without altering the number of counterparts at later timesteps.
I could see it being the case that Alice’s AI would not want to do this because after killing Bob’s AI the incentive to create a copy would be lost. However, if it is able to pre-commit to making a copy it may still want to. Also, a certain implementation of the POSC + POST combo might help avoid this scenario. I am not exactly sure what was meant by “copy timeslices”.
In general, I wonder about the open question of specifying counterparts and the Nearest Unblocked Strategy problem that you describe. It might be deceptively tricky to specify these things in the same way that human feedback seems to be deceptively difficult to properly operationalize.
One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace “making copies” with “moving existing copies”. Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?