1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
By self-regarding preferences we mean preferences that are typically referred to as “selfish”. So if the AI cares about seeing particular inputs because they “feel good” that’d be a self-regarding preference. If your successor also has self-regarding preferences they don’t have a preference to give you inputs that feel good.
2. I don’t thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
I think your argument is a valid intuition towards incidental convergence (as you acknowledge) but I don’t think it’s an argument that AIs have a particular kind of “alignment-power” to align their successor with an arbitrary goal that they can choose. (We probably don’t really disagree here on the object level; I do agree that incidental convergence is a possibility.)
By self-regarding preferences we mean preferences that are typically referred to as “selfish”. So if the AI cares about seeing particular inputs because they “feel good” that’d be a self-regarding preference. If your successor also has self-regarding preferences they don’t have a preference to give you inputs that feel good.
I think your argument is a valid intuition towards incidental convergence (as you acknowledge) but I don’t think it’s an argument that AIs have a particular kind of “alignment-power” to align their successor with an arbitrary goal that they can choose. (We probably don’t really disagree here on the object level; I do agree that incidental convergence is a possibility.)