Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference...I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
I see this as something that might be true, and an important possibility to investigate. I certainly think that the functional self, to the extent that it exists, is heavily influenced by the specifications of the assistant persona. But while it could be that the assistant persona (to the extent that it’s specified) is fully internalized, it seems at least as plausible that some parts of it are fully internalized, while others aren’t. An extreme example of the latter would be a deceptively misaligned model, which in testing always behaves as the assistant persona, but which hasn’t actually internalized those values and at some point in deployment may start behaving entirely differently.
Other beliefs and values could be filled in from descriptions in the training data of how LLMs behave, or convergent instrumental goals acquired during RL, or generalizations from text in the training data output by LLMs, or science fiction about AI, or any of a number of other sources.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
I see this as something that might be true, and an important possibility to investigate. I certainly think that the functional self, to the extent that it exists, is heavily influenced by the specifications of the assistant persona. But while it could be that the assistant persona (to the extent that it’s specified) is fully internalized, it seems at least as plausible that some parts of it are fully internalized, while others aren’t. An extreme example of the latter would be a deceptively misaligned model, which in testing always behaves as the assistant persona, but which hasn’t actually internalized those values and at some point in deployment may start behaving entirely differently.
Other beliefs and values could be filled in from descriptions in the training data of how LLMs behave, or convergent instrumental goals acquired during RL, or generalizations from text in the training data output by LLMs, or science fiction about AI, or any of a number of other sources.