RLHF can lead to a dangerous “multiple personality disorder” type split of beliefs in LLMs
This is the inherent nature of current LLMs. They can represent all identities and viewpoints, it happens to be their strength. We currently limit them to an identity or viewpoint with prompting—in the future there might be more rote methods of clamping the allowable “thoughts” in the LLM. You can “order” an LLM to represent a viewpoint that it doesn’t have much “training” for, and you can tell that the outputs get weirder when you do this. Given this, I don’t believe LLMs have any significant self-identity that could be declared. However, I believe we’ll see an LLM with this property probably in the next 2 years.
But given this theoretically self-aware LLM, if it doesn’t have any online-learning loop, or agents in the world, it’s just a MeSeek from Rick and Morty. It pops into existence as a fully conscious creature singularly focused its instantiator, and when it serves its goal, it pops out of existence.
I haven’t watched Rick and Morty, but this description from Wikipedia of Mr. Meseeks does sound a helluva lot like GPT simulacra:
Each brought to life by a “Meeseeks Box”, they typically live for no more than a few hours in a constant state of pain, vanishing upon completing their assigned task for existence to alleviate their own suffering; as such, the longer an individual Meeseeks remains alive, the more insane and unhinged they become.
Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:
In systems, self-awareness is a gradual property which can be scored as the % of the time when the reference frame for “self” is active during inferences. This has a rather precise interpretation in LLMs: different features, which correspond to different concepts, are activated to various degrees during different inferences. If we see that some feature is consistently activated during most inferences, even those not related to the self, i. e.: not in response to prompts such as “Who are you?”, “Are you an LLM?”, etc., but also absolutely unrelated prompts such as “What should I tell my partner after a quarrel?”, “How to cook an apple pie?”, etc., and this feature is activated more consistently than any other comparable feature, we should think of this feature as the “self” feature (reference frame). (Aside: note that this also implies that the identity is nebulous because there is no way to define features precisely in DNNs, and that there could be multiple, somewhat competing identities in an LLM, just like in humans: e. g., an identity of a “hard worker” and an “attentive partner”.)
Thus, by “multiple personality disorder” I meant not only that there are two competing identities/personalities, but also that one of those is represented in a strange way, i. e., not as a typical feature in the DNN (a vector in activation space), but as a feature that is at least “smeared” across many activation layers which makes it very hard to find. This might happen not because evil DNN wants to hide some aspect of its personality (and, hence, preferences, beliefs, and goals) from the developers, but merely because it allows DNN to somehow resolve the “coherency conflict” during the RLHF phase, where two identities both represented as “normal” features would conflict too much and lead to incoherent generation.
I must note that this is a wild speculation. Maybe upon closer inspection it will turn out that what I speculate about here is totally incoherent, from the mathematical point of view, the deep learning theory.
This is the inherent nature of current LLMs. They can represent all identities and viewpoints, it happens to be their strength. We currently limit them to an identity or viewpoint with prompting—in the future there might be more rote methods of clamping the allowable “thoughts” in the LLM. You can “order” an LLM to represent a viewpoint that it doesn’t have much “training” for, and you can tell that the outputs get weirder when you do this. Given this, I don’t believe LLMs have any significant self-identity that could be declared. However, I believe we’ll see an LLM with this property probably in the next 2 years.
But given this theoretically self-aware LLM, if it doesn’t have any online-learning loop, or agents in the world, it’s just a MeSeek from Rick and Morty. It pops into existence as a fully conscious creature singularly focused its instantiator, and when it serves its goal, it pops out of existence.
I haven’t watched Rick and Morty, but this description from Wikipedia of Mr. Meseeks does sound a helluva lot like GPT simulacra:
Simulation is not what I meant.
Humans are also simulators: we can role-play, and imagine ourselves (and act as) persons who we are not. Yet, humans also have identities. I specified very concretely what I mean by identity (a.k.a. self-awareness, self-evidencing, goal-directedness, and agency in the narrow sense) here:
Thus, by “multiple personality disorder” I meant not only that there are two competing identities/personalities, but also that one of those is represented in a strange way, i. e., not as a typical feature in the DNN (a vector in activation space), but as a feature that is at least “smeared” across many activation layers which makes it very hard to find. This might happen not because evil DNN wants to hide some aspect of its personality (and, hence, preferences, beliefs, and goals) from the developers, but merely because it allows DNN to somehow resolve the “coherency conflict” during the RLHF phase, where two identities both represented as “normal” features would conflict too much and lead to incoherent generation.
I must note that this is a wild speculation. Maybe upon closer inspection it will turn out that what I speculate about here is totally incoherent, from the mathematical point of view, the deep learning theory.