If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
My concern is that since CDT is not reflectively stable, it may have incentives to create non-CDT agents in order to fulfill instrumental goals.
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
With the correction that it is updateless and CDT (see here), I agree with the rest of this.