When preference makes references to self, copying (that doesn’t also edit these references) changes the meaning of preference, doesn’t preserve it. So if you are expecting copying, reflective consistency could be ensured by reformulating preference to avoid explicit references to self, such as by replacing them with references to a particular person, or to a reference class of people, whether they are yourself or not.
The reformulation of preference to replace references to self with specific people it already references doesn’t change its meaning, so semantically such rewriting doesn’t affect alignment. It only affects copying, which doesn’t respect semantics of preferences. Other procedures that meddle with minds can disrupt semantics of preference in a way that can’t be worked around.
(All this only makes sense for toy agent models, that furthermore have a clear notion of references to self, not for literal humans. Humans don’t have preferences in this sense, human preference is a theoretical construct that needs something like CEV to access, the outcome of a properly set up long reflection.)
When preference makes references to self, copying (that doesn’t also edit these references) changes the meaning of preference, doesn’t preserve it. So if you are expecting copying, reflective consistency could be ensured by reformulating preference to avoid explicit references to self, such as by replacing them with references to a particular person, or to a reference class of people, whether they are yourself or not.
The reformulation of preference to replace references to self with specific people it already references doesn’t change its meaning, so semantically such rewriting doesn’t affect alignment. It only affects copying, which doesn’t respect semantics of preferences. Other procedures that meddle with minds can disrupt semantics of preference in a way that can’t be worked around.
(All this only makes sense for toy agent models, that furthermore have a clear notion of references to self, not for literal humans. Humans don’t have preferences in this sense, human preference is a theoretical construct that needs something like CEV to access, the outcome of a properly set up long reflection.)