If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
If I understand correctly, it’s actually updateless within an episode, and that’s the only thing it cares about so I don’t see how it would not be reflectively stable. Plus, even if it had an incentive to create a non-CDT agent, it would have to do that by outputting some message to the operator, and the operator wouldn’t have the ability to create a non-CDT agent without leaving the room which would end the episode. (I guess it could hack the operator’s mind and create a non-CDT agent within, but at that point it might as well just make the operator give it max rewards.)
With the correction that it is updateless and CDT (see here), I agree with the rest of this.