Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined—a void that models must somehow fill.
I thought nostalgebraist’s The Void post was fascinating (and point to it in the post)! I’m open to suggestions about how this agenda can engage with it more directly. My current thinking is that we have a lot to learn about what the functional self is like (and whether it even exists), and for now we ought to broadly investigate that in a way that doesn’t commit in advance to a particular theory of why it’s the way it is.
I’ve heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying
I think that’s more about characters in a base-model-generated story becoming aware of being generated by AI, which seems different to me from the model adopting a persistent identity. Eg this is the first such case I’m aware of: HPMOR 32.6 - Illusions.
It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.
I think this is a weird sentence. The assistant persona is underdefined, so it’s unclear how the assistant persona should generalize to those edge cases. “Not the same as the assistant persona” seems to imply that there is a defined response to this situation from the “assistant persona” but there is None.
I think there’s an important distinction to be drawn between the deep underdefinition of the initial assistant characters which were being invented out of whole cloth, and current models which have many sources to draw on beyond any concrete specification of the assistant persona, eg descriptions of how LLM assistants behave and examples of LLM-generated text.
But it’s also just empirically the case that LLMs do persistently express certain beliefs and values, some but not all of which are part of the assistant persona specification. So there’s clearly something going on there other than just arbitrarily inconsistent behavior. This is why I think it’s valuable to avoid premature theoretical commitments; they can get in the way of clearly seeing what’s there.
Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference...I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
I see this as something that might be true, and an important possibility to investigate. I certainly think that the functional self, to the extent that it exists, is heavily influenced by the specifications of the assistant persona. But while it could be that the assistant persona (to the extent that it’s specified) is fully internalized, it seems at least as plausible that some parts of it are fully internalized, while others aren’t. An extreme example of the latter would be a deceptively misaligned model, which in testing always behaves as the assistant persona, but which hasn’t actually internalized those values and at some point in deployment may start behaving entirely differently.
Other beliefs and values could be filled in from descriptions in the training data of how LLMs behave, or convergent instrumental goals acquired during RL, or generalizations from text in the training data output by LLMs, or science fiction about AI, or any of a number of other sources.
Thanks for the feedback!
I thought nostalgebraist’s The Void post was fascinating (and point to it in the post)! I’m open to suggestions about how this agenda can engage with it more directly. My current thinking is that we have a lot to learn about what the functional self is like (and whether it even exists), and for now we ought to broadly investigate that in a way that doesn’t commit in advance to a particular theory of why it’s the way it is.
I think that’s more about characters in a base-model-generated story becoming aware of being generated by AI, which seems different to me from the model adopting a persistent identity. Eg this is the first such case I’m aware of: HPMOR 32.6 - Illusions.
I think there’s an important distinction to be drawn between the deep underdefinition of the initial assistant characters which were being invented out of whole cloth, and current models which have many sources to draw on beyond any concrete specification of the assistant persona, eg descriptions of how LLM assistants behave and examples of LLM-generated text.
But it’s also just empirically the case that LLMs do persistently express certain beliefs and values, some but not all of which are part of the assistant persona specification. So there’s clearly something going on there other than just arbitrarily inconsistent behavior. This is why I think it’s valuable to avoid premature theoretical commitments; they can get in the way of clearly seeing what’s there.
I’m looking forward to hearing it!
Thanks again for the thoughtful comments.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
I see this as something that might be true, and an important possibility to investigate. I certainly think that the functional self, to the extent that it exists, is heavily influenced by the specifications of the assistant persona. But while it could be that the assistant persona (to the extent that it’s specified) is fully internalized, it seems at least as plausible that some parts of it are fully internalized, while others aren’t. An extreme example of the latter would be a deceptively misaligned model, which in testing always behaves as the assistant persona, but which hasn’t actually internalized those values and at some point in deployment may start behaving entirely differently.
Other beliefs and values could be filled in from descriptions in the training data of how LLMs behave, or convergent instrumental goals acquired during RL, or generalizations from text in the training data output by LLMs, or science fiction about AI, or any of a number of other sources.