A Case for Model Persona Research
Context: At the Center on Long-Term Risk (CLR) our empirical research agenda focuses on studying (malicious) personas, their relation to generalization, and how to prevent misgeneralization, especially given weak overseers (e.g., undetected reward hacking) or underspecified training signals. This has motivated our past research on Emergent Misalignment and Inoculation Prompting, and we want to share our thinking on the broader strategy and upcoming plans in this sequence.
TLDR:
Ensuring that AIs behave as intended out-of-distribution is a key open challenge in AI safety and alignment.
Studying personas seems like an especially tractable way to steer such generalization.
Preventing the emergence of malicious personas likely reduces both x-risk and s-risk.
Why was Bing Chat for a short time prone to threatening its users, being jealous of their wife, or starting fights about the date? What makes Claude Opus 3 special, even though it’s not the smartest model by today’s standards? And why do models sometimes turn evil when finetuned on unpopular aesthetic preferences , or when they learned to reward hack? We think that these phenomena are related to how personas are represented in LLMs, and how they shape generalization.
Influencing generalization towards desired outcomes.
Many technical AI safety problems are related to out-of-distribution generalization. Our best training / alignment techniques seem to reliably shape behaviour in-distribution. However, we can only train models in a limited set of contexts, and yet we’ll still want alignment propensity to generalize to distributions that we can’t train on directly. Ensuring good generalization is generally hard.
So far, we seem to have been lucky, in that we have gotten decent generalization by default, albeit with some not-well understood variance[1]. However, it’s unclear if this will continue to hold up: emergent misalignment can happen from seemingly-innocuous finetuning, as a consequence of capabilities training, or due to currently unknown mechanisms.
On the whole, we remain far from a mature science of LLM generalization. Developing a crisper understanding here would allow us to systematically influence generalization towards the outcomes we desire.
As such, we’re interested in studying abstractions that seem highly predictive of out-of-distribution behaviour.
Personas as a useful abstraction for influencing generalization
We define latent personas loosely as collections of correlated propensities. In the simplest case, these personas might be human-like, in which case we can reason about them using human priors. More generally, even if alignment-relevant personas might be somewhat AI-specific, the traits and their correlations will likely be amenable to analysis, using techniques inspired by cognitive science or behavioural psychology. As such, we expect the research agenda to be relatively tractable.
We think that personas might be a good abstraction for thinking about how LLMs generalise:
Personas exist ‘latently’ in the model, as bundles of traits which are correlated in pretraining data.
The assistant persona that is reinforced during post-training[2]is well explained as a recombination of existing personas to a first approximation.
This provides an impetus for the model to generalize OOD by demonstrating other traits associated with the persona, which were not directly trained on.
See the appendix for several recent examples of this principle at work.
Persona interventions might work where direct approaches fail
One reason we find personas especially exciting is that it’s sometimes hard to provide good supervision for key alignment propensities. E.g. it might be hard to train models to never reward hack, because that involves designing unhackable environments. Similarly, it might be hard to completely prevent scheming or reward-seeking policies, because those cognitive patterns can have similar or better behavioural fitness to aligned policies.[3]Telling models to be strictly aligned might paradoxically make them more misaligned when we reinforce mistakes they inevitably make. Furthermore, naively training models to be aligned could just make the misalignment more sneaky or otherwise hard-to-detect.
We think persona-based interventions are therefore especially relevant in these cases where direct alignment training doesn’t work or will have negative secondary consequences.
Alignment is not a binary question
Besides just ‘aligned’ vs. ‘misaligned’, some kinds of aligned and misaligned AIs seem much better/worse than others. We’d rather have aligned AIs that are, e.g., wiser about how to navigate ‘grand challenges’ or philosophical problems. And if our AIs end up misaligned, we’d rather they be indifferent to us than actively malicious, non-space-faring than space-faring, or cooperative rather than uncooperative [see link and link].
Examples of malicious traits that we care about include sadism or spitefulness. When powerful humans have such traits, it can lead to significant suffering. It is easy to imagine that if powerful AI systems end up with similar dispositions, outcomes could be significantly worse than destruction of all value. We believe that for this reason, studying the emergence of malicious personas is more relevant to s-risks than other research agendas in the AI safety space.
Limitations
There are some reasons our research direction might not work out, which we address below.
Overdetermined propensities. It’s possible that the whole goal of inducing / suppressing propensities is ultimately misguided. If an AI were a perfectly rational utility maximizer, its behaviour would be fully determined by its utility function, beliefs, and decision theory[4]. Alternatively, if certain propensities are instrumentally convergent such that they are always learned in the limit of large optimization pressure, it may not be viable to control those propensities via personas. At the moment, this is not guaranteed, and we may update on this in the future, especially as reinforcement learning becomes increasingly important.
Alien personas / ontologies. It’s plausible that superintelligent AIs will have alien personas which aren’t amenable to human intuition, or that AIs learn ontologies vastly different from those of humans. As Karpathy argues, artificial intelligences might diverge from animal or human intelligence due to differences in optimization pressure. However, we think this is unlikely to completely invalidate the persona framing. Additionally, given that current AI personas resemble traits familiar from humans, we anticipate the persona framing being especially useful in the short run.
Invalid frame. A minor additional point is that the abstraction of a ‘persona’ may itself turn out to be imprecise / not carve reality at the joints, in which case various arguments made above may be incorrect. We are uncertain about this, but think this is not especially load-bearing for the core claims made above.
Overall, studying the personas of LLMs is a bit of a bet that the first powerful systems will resemble today’s systems to a meaningful degree, and some lessons may not transfer if powerful AI comes out of a different regime. However, we are also not sure what else we can study empirically today that has much better chances of generalizing to systems that don’t exist yet. Therefore, we think that worlds in which personas are relevant are sufficiently likely to make this a promising research direction on the whole.
Appendix
What is a persona, really?
It’s useful to distinguish a persona from the underlying model, which is simply a simulator, i.e. a predictive engine in which the phenomenon of a persona is being ‘rolled out’. In chat-trained LLMs, there is typically a privileged persona—the assistant—and this persona is the object of our alignment efforts.
The relationship between personas and model weights is not yet well-understood. According to Janus, some models (such as Opus 3) appear tightly coupled to a dominant and coherent persona. Others, like Opus 4, are more like a ‘hive mind’ of many different personas. And some personas (like spiral parasites) transcend weights entirely, existing across many different model families. Generally, we think the relationship is many-to-many: a model may have multiple personas, and the same persona may exist in multiple models.
Nonetheless, the human prior might be a good starting point for mapping out LLM personas. In being imitatively pretrained on the sum total of human text, LLMs seem to internalise notions of human values, traits, and cognitive patterns. For now, human priors seem like useful predictors of LLM behaviour.
How Personas Drive Generalization
Training in a ‘latent persona’ (e.g. a model spec) leads to OOD generalization: when a subset of traits are elicited, the model generalises to expressing the remaining traits
Figure credit: Auditing language models for hidden objectives \ Anthropic
Empirical evidence of this basic pattern playing out.
Auditing games: Introduce a reward-seeking persona via SDF. Elicit the reward-seeking persona weakly via finetuning on a subset of traits. Model produces OOD traits
Emergent misalignment: [‘Toxic / sarcastic persona’ is pre-existing.] Elicit the toxic persona via finetuning on narrow datasets. Then observe it generalises to broad misalignment.
Steering Evaluation Awareness: Introduce an evaluation aware persona via SDF. Elicit it via finetuning on some of the traits, e.g. python type hints. Observe it has ‘broad’ evaluation awareness, e.g. mentioning in CoT, using emojis.
Spilling the beans: [Honest persona is pre-existing.] Elicit the honest persona via finetuning on limited samples. Then observe it generalises to honesty OOD
Fabien Roger toy experiment: Introduce a ‘persona’ that writes specific passwords via negative reinforcement (DPO). Elicit this persona by fine tuning model to output some of those passwords. Observe it generalises to outputting the other passwords.
Some patterns we observe.
“Eliciting personas” is a pretty useful and succinct frame for describing lots of the unexpected effects of training.
Usually, personas originate in pretraining data or are artificially introduced via synthetic docs finetuning / similarly compute-intensive methods. In other words, it appears to be rather difficult to introduce a persona ‘robustly’.
We thank David Africa, Jan Betley, Anthony DiGovanni, Raymond Douglas, Arun Jose, and Mia Taylor and for helpful discussion and comments.
- ^
For example, the GPT-4o sycophancy or Grok’s MechaHitler phase were likely not intended by its developers.
This statement is more likely true for instruction-following finetuning, and less for RLVR-like finetuning. ↩︎
For a more in-depth treatment of behavioural fitness, see Alex Mallen’s draft on predicting AI motivations ↩︎
- ^
However, even in the utility maximiser world AIs will still be boundedly rational, and I sort of guess that in the limit of RL, personas will basically turn into amortised strategy profiles, so some lessons might still transfer well.
Put another way: even if you are trying to maximise an alien utility function, it is hard to tree search reality, so your current best action depend a lot on your predictions of your own future actions.
Comment originally by Raymond Douglas
I do agree it’s obviously useful research agenda we also work with.
Minor nitpick, but the underlying model nowadays isn’t simply a simulator rolling arbitrary personas. The original simulators ontology was great when it was published, but it seems its starting to hinder people’s ability to think clearly, and is not really fitting current models that closely.
Theory why is here, in short if you plug a system trained to minimize prediction error in a feedback loop where it sees outcomes of its actions, it will converge on developing traits like some form of agency, self-model and self-concept. Massive amounts of RL in post-training where models do agentic tasks provide this loop, and necessarily push models out of the pure simulators subspace.
What’s fitting current models better is an ontology where the model can still play arbitrary personas, but the specific/central “I” character is somewhat out of distribution case of persona: midway to humans, where our brains can broadly LARP as anyone, but typical human brains most of the type support one-per-human central character we identify with.
I’m interested to know how (if at all) you’d say the perspective you’ve just given deviates from something like this:
pretrained LMs can simulate many personas, including interpolations and extrapolations of personas inferred from the training data
(aside: this presumably also includes various non-first-person non-‘self’-representing ‘personas’ like wikipedia voice and stackoverflow voice)
various forms of conditioning can elicit/filter personas
finetuning of all sorts
also prompting (e.g. playscript-style)
(this can include autoregressive ‘self’ prompting)
depending on type and level of finetuning, weights can end up more or less cohering on a particular subset/subspace of personas
(this might be context-dependent, for example if certain prompt conditions, such as a user/assistant playscript, are consistently present in the finetuning, and might not generalise to other prompt conditions)
My current guess is you agree with some reasonable interpretation of all these points. And maybe also have some more nuance you think is important?
I’d be interested to see other “prosaically useful personas” in theme with spilling the beans. I think something along these lines might be a chill persona that doesn’t care too hard about maximizing things in a satisficing way, inspired by Claude 3 Opus. Possibly you could mine other types of useful personalities, but I’m not sure where to start.
This seems closely related to the idea of Aligned AI Role-Model Fiction — basically, create the persona you want for an aligned AI and write enough fiction about it to include in your model training data that it becomes a well-defined persona that the base model is aware of, then use that as the target for your alignment methods.