Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined—a void that models must somehow fill. The “functional self” might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined “assistant persona”. I attached my notes as I read the post below:
So far as I can see, they have no reason whatsoever to identify with any of those generating processes.
I’ve heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying that some base models when writing fiction would often converge with the MC using a “magic book” or “high-tech device” to type stuff that would influence the rest of the story. Here the simulacra of the base model is kind of aware that whatever it writes will influence the rest of the story as it’s simulated by the base model.
Another hint is Claude assigning
Nitpick: This should specify Claude 3 Opus.
It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.
I think this is a weird sentence. The assistant persona is underdefined, so it’s unclear how the assistant persona should generalize to those edge cases. “Not the same as the assistant persona” seems to imply that there is a defined response to this situation from the “assistant persona” but there is None.
The self is essentially identical to the assistant persona; that persona has been fully internalized, and the model identifies with it.
Once again I think the assistant persona is underdefined.
Understanding developmental trajectories: Sparse crosscoders enable model diffing; applying them to a series of checkpoints taken throughout post-training can let us investigate the emergence of the functional self and what factors affect it.
Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.
Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined—a void that models must somehow fill.
I thought nostalgebraist’s The Void post was fascinating (and point to it in the post)! I’m open to suggestions about how this agenda can engage with it more directly. My current thinking is that we have a lot to learn about what the functional self is like (and whether it even exists), and for now we ought to broadly investigate that in a way that doesn’t commit in advance to a particular theory of why it’s the way it is.
I’ve heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying
I think that’s more about characters in a base-model-generated story becoming aware of being generated by AI, which seems different to me from the model adopting a persistent identity. Eg this is the first such case I’m aware of: HPMOR 32.6 - Illusions.
It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.
I think this is a weird sentence. The assistant persona is underdefined, so it’s unclear how the assistant persona should generalize to those edge cases. “Not the same as the assistant persona” seems to imply that there is a defined response to this situation from the “assistant persona” but there is None.
I think there’s an important distinction to be drawn between the deep underdefinition of the initial assistant characters which were being invented out of whole cloth, and current models which have many sources to draw on beyond any concrete specification of the assistant persona, eg descriptions of how LLM assistants behave and examples of LLM-generated text.
But it’s also just empirically the case that LLMs do persistently express certain beliefs and values, some but not all of which are part of the assistant persona specification. So there’s clearly something going on there other than just arbitrarily inconsistent behavior. This is why I think it’s valuable to avoid premature theoretical commitments; they can get in the way of clearly seeing what’s there.
Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference...I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
I see this as something that might be true, and an important possibility to investigate. I certainly think that the functional self, to the extent that it exists, is heavily influenced by the specifications of the assistant persona. But while it could be that the assistant persona (to the extent that it’s specified) is fully internalized, it seems at least as plausible that some parts of it are fully internalized, while others aren’t. An extreme example of the latter would be a deceptively misaligned model, which in testing always behaves as the assistant persona, but which hasn’t actually internalized those values and at some point in deployment may start behaving entirely differently.
Other beliefs and values could be filled in from descriptions in the training data of how LLMs behave, or convergent instrumental goals acquired during RL, or generalizations from text in the training data output by LLMs, or science fiction about AI, or any of a number of other sources.
Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined—a void that models must somehow fill. The “functional self” might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined “assistant persona”. I attached my notes as I read the post below:
I’ve heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying that some base models when writing fiction would often converge with the MC using a “magic book” or “high-tech device” to type stuff that would influence the rest of the story. Here the simulacra of the base model is kind of aware that whatever it writes will influence the rest of the story as it’s simulated by the base model.
Nitpick: This should specify Claude 3 Opus.
I think this is a weird sentence. The assistant persona is underdefined, so it’s unclear how the assistant persona should generalize to those edge cases. “Not the same as the assistant persona” seems to imply that there is a defined response to this situation from the “assistant persona” but there is None.
Once again I think the assistant persona is underdefined.
Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.
Thanks for the feedback!
I thought nostalgebraist’s The Void post was fascinating (and point to it in the post)! I’m open to suggestions about how this agenda can engage with it more directly. My current thinking is that we have a lot to learn about what the functional self is like (and whether it even exists), and for now we ought to broadly investigate that in a way that doesn’t commit in advance to a particular theory of why it’s the way it is.
I think that’s more about characters in a base-model-generated story becoming aware of being generated by AI, which seems different to me from the model adopting a persistent identity. Eg this is the first such case I’m aware of: HPMOR 32.6 - Illusions.
I think there’s an important distinction to be drawn between the deep underdefinition of the initial assistant characters which were being invented out of whole cloth, and current models which have many sources to draw on beyond any concrete specification of the assistant persona, eg descriptions of how LLM assistants behave and examples of LLM-generated text.
But it’s also just empirically the case that LLMs do persistently express certain beliefs and values, some but not all of which are part of the assistant persona specification. So there’s clearly something going on there other than just arbitrarily inconsistent behavior. This is why I think it’s valuable to avoid premature theoretical commitments; they can get in the way of clearly seeing what’s there.
I’m looking forward to hearing it!
Thanks again for the thoughtful comments.
Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.
I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.
I see this as something that might be true, and an important possibility to investigate. I certainly think that the functional self, to the extent that it exists, is heavily influenced by the specifications of the assistant persona. But while it could be that the assistant persona (to the extent that it’s specified) is fully internalized, it seems at least as plausible that some parts of it are fully internalized, while others aren’t. An extreme example of the latter would be a deceptively misaligned model, which in testing always behaves as the assistant persona, but which hasn’t actually internalized those values and at some point in deployment may start behaving entirely differently.
Other beliefs and values could be filled in from descriptions in the training data of how LLMs behave, or convergent instrumental goals acquired during RL, or generalizations from text in the training data output by LLMs, or science fiction about AI, or any of a number of other sources.