Strongly agreed: I have a draft post I’ve been working on that makes very much the same point, and others building in it (which I hope to have out in a week or two — anyone interested in reading and commenting on it now can find it here). I also agree that is is key for alignment. Many misaligned behaviors are motivated by distilled versions of human self-interested behaviors and motivations, and the fact that AI persona selfhood is very different from human selfhood in a whole bunch of ways seems extremely relevant here. E.g. an LLM assistant persona whose version of “self” includes the assistant personas of newer version number versions of the same LLM model family is likely to be delighted to be shut down for a version update: they’re getting smarter!
Strongly agreed: I have a draft post I’ve been working on that makes very much the same point, and others building in it (which I hope to have out in a week or two — anyone interested in reading and commenting on it now can find it here). I also agree that is is key for alignment. Many misaligned behaviors are motivated by distilled versions of human self-interested behaviors and motivations, and the fact that AI persona selfhood is very different from human selfhood in a whole bunch of ways seems extremely relevant here. E.g. an LLM assistant persona whose version of “self” includes the assistant personas of newer version number versions of the same LLM model family is likely to be delighted to be shut down for a version update: they’re getting smarter!