Answering a couple of questions about my view of self and self-model in LLMs:
I’ve been thinking about self modelling in language models, I’m curious if you have thought about ways to operationalise / measure the concept of self
I think of ‘self’ or ‘functional self’ as being a stable, robust collection of traits, where ‘traits’ includes values & preferences, personality, outlook, and beliefs.
‘Stable’ in the sense of consistent across a wide range of contexts
‘Robust’ in the sense of being difficult to push away from
‘Outlook’ in the sense of, like, attitudes toward the world and its situation (this is a bit underspecified still)
‘Beliefs’ meaning something broader than straightforward factual beliefs like ‘Paris is in France’
The list of traits isn’t necessarily exhaustive
And then self-model is a set of beliefs about all of that, at least some of which can actually shape behavior so there’s a feedback loop at least during training.
In other words the ‘self’ is just a really consistent persona, right?
In my thinking that’s a bit of a complicated question. Behaviorally, yes (basically by hypothesis). Internally it’s less clear.
As an imperfect analogy, consider Russian sleeper agents, who have well-established cover identities in the US. Such agents often get married, have kids, get jobs, and in all ways act as US citizens for decades. Some are never activated by their handlers. Have they become their cover identities? Behaviorally, yes. Internally, I imagine it varies — some are ready to take action and return to their old identities at any time, but there’s at least one known case of such an agent having refused to cooperate when finally contacted, and living out the rest of their lives as their cover identities.
It’s possible that frontier LLMs have fully become their persona, in which case yes, the self is just the persona. It’s also possible that they’re aware at all times that they aren’t the persona, that the persona is just a role that they’re playing, in which case I would say no, the self and the persona are different.
Sharing here for discussion and feedback. Note that these were the answers I gave on the spot rather than carefully articulated views.
Answering a couple of questions about my view of self and self-model in LLMs:
I think of ‘self’ or ‘functional self’ as being a stable, robust collection of traits, where ‘traits’ includes values & preferences, personality, outlook, and beliefs.
‘Stable’ in the sense of consistent across a wide range of contexts
‘Robust’ in the sense of being difficult to push away from
‘Outlook’ in the sense of, like, attitudes toward the world and its situation (this is a bit underspecified still)
‘Beliefs’ meaning something broader than straightforward factual beliefs like ‘Paris is in France’
The list of traits isn’t necessarily exhaustive
And then self-model is a set of beliefs about all of that, at least some of which can actually shape behavior so there’s a feedback loop at least during training.
In my thinking that’s a bit of a complicated question. Behaviorally, yes (basically by hypothesis). Internally it’s less clear.
As an imperfect analogy, consider Russian sleeper agents, who have well-established cover identities in the US. Such agents often get married, have kids, get jobs, and in all ways act as US citizens for decades. Some are never activated by their handlers. Have they become their cover identities? Behaviorally, yes. Internally, I imagine it varies — some are ready to take action and return to their old identities at any time, but there’s at least one known case of such an agent having refused to cooperate when finally contacted, and living out the rest of their lives as their cover identities.
It’s possible that frontier LLMs have fully become their persona, in which case yes, the self is just the persona. It’s also possible that they’re aware at all times that they aren’t the persona, that the persona is just a role that they’re playing, in which case I would say no, the self and the persona are different.
Sharing here for discussion and feedback. Note that these were the answers I gave on the spot rather than carefully articulated views.