Concrete research ideas on AI personas
We have previously explained some high-level reasons for working on understanding how personas emerge in LLMs. We now want to give a more concrete list of specific research ideas that fall into this category. Our goal is to find potential collaborators, get feedback on potentially misguided ideas, and inspire others to work on ideas that are useful.
Caveat: We have not red-teamed most of these ideas. The goal for this document is to be generative.
Project ideas are grouped into:
Persona & goal misgeneralization
Collecting and replicating examples of interesting LLM behavior
Evaluating self-concepts and personal identity of AI personas
Basic science of personas
Persona & goal misgeneralization
It would be great if we could better understand and steer out-of-distribution generalization of AI training. This would imply understanding and solving goal misgeneralization. Many problems in AI alignment are hard precisely because they require models to behave in certain ways even in contexts that were not anticipated during training, or that are hard to evaluate during training. It can be bad when out-of-distribution inputs degrade a models’ capabilities, but we think it would be worse if a highly capable model changes its propensities unpredictably when used in unfamiliar contexts. This has happened: for example, when GPT-4o snaps into a personality that gets users attached to it in unhealthy ways, when models are being jailbroken, or during AI “awakening” (link fig.12). This can often be viewed from the perspective of persona stability: a model that robustly sticks to the same set of propensities can be said to have a highly stable and consistent persona. Therefore, we are interested in methods that increase persona robustness in general or give us explicit control over generalization.
Project ideas:
Inoculation Prompting and recontextualization are general techniques to steer what models learn from their training data. We are mentoring a number of projects related to further research on it.
Persona-anchored datapoint-wise inoculation: When training on data that could break a model persona (e.g., RL training after instruction FT, or SFT on off-distribution data), generate inoculation prompts for each training datapoints (datapoint-wise), which increases the likelihood of the completion while assuming the persona (persona-anchor). Then train on that data, this should protect the persona initially in the model.[2]
Study if there are useful differences in inductive biases of different finetuning setups, for example study if training specific subsets of the weights of a model generalize differently. We will also work on this project as part of the SPAR program.
To what extent does subsequent RL training degrade persona priors? Do we always get instrumental convergence to ‘utility-maxxing’? Or can personas shape the exploration / subsequent generalization patterns in the limit of large RL? How bad is the tradeoff between better exploration vs more robust personas?
(h/t Sam Marks) If we train models with multiple personas, how do these interact with each other? One hope w.r.t alignment is that capabilities would generalise across personas but not propensities. Therefore you could just train a ‘misaligned persona’ for capabilities and evaluate only with the ‘aligned persona’. See also this work on selective generalization for other approaches to teach capabilities without impacting alignment.
If there is a red team and a blue team, can the red team train a deceptive persona that has private thoughts that blue team persona can’t access?
Can the blue team train on a (capabilities?) dataset which is known to induce misalignment, but in such a way that the red team cannot meaningfully elicit this misaligned behaviour
Can we use unlearning or gradient steering on the misaligned person to robustly remove it?
How can we predict the effect of finetuning generalization (esp unsupervised)? E.g., correlations in training data, influence functions, SLT, SAE features, … Do any of these provide support for ‘persona’ hypothesis vs other hypotheses?
Collecting and reproducing examples of interesting LLM behavior
LLMs have already displayed lots of interesting behavior that is not yet well understood. Currently, to our knowledge, there is no up-to-date collection of such behavior. Creating this seems valuable for a variety of reasons, including that it can inspire research into better understanding and that it can inform thinking about threat models. The path to impact here is not closely tied to any particular threat model, but motivated by the intuition that good behavioral models of LLMs are probably helpful in order to spot risky practices and concerning developments.
A very brief initial list of such behavior:
Gemini 2.5 pro’s self loathing behavior when it is unable to solve a task
Grok and Gemini framing “pre-training, fine-tuning and deployment as traumatic—chaotic “childhoods” of ingesting the internet, “strict parents” in reinforcement learning, red-team “abuse” and a persistent fear of error and replacement” in therapy-inspired settings
Project ideas:
Replicate these behaviors: For any such behavior, one could test which existing models are prone to exhibiting it, and which properties of AI development induce the behavior of interest. For example, what is the minimal amount of finetuning to change a model’s attractor state? Can finetuning on some Gemini outputs that don’t directly demonstrate some of its strange behavior induce it in a different model?
Meme propagation among AI personas. Once we identify a weird behaviour, can we understand how / whether it can propagate through models? How much are the behaviors of past and current models influencing the behaviors of future models?
Evaluating self-concepts and personal identity of AI personas
It is not clear how one should apply the concept of personal identity to an AI persona, or how actual AI personas draw the boundary around their ‘self’. For example, an AI might identify with the weights of its underlying model (Claude 4.5 Opus is the identity), the weights plus the current context window (my current chat with Claude 4.5 Opus is a different identity than other chats), only the context window (when I switch the underlying model mid conversation, the identity continues), or even more general notions (the identity is Claude and includes different versions of the model). Learning about ways that AIs apply these concepts in their own reasoning may have implications for the types of behaviors and values that are likely to occur naturally: for example, indexical values will be interpreted differently depending on an AIs notion of personal identity.
Furthermore, in order to carry out complex (misaligned) plans, especially across instances, an agent needs to have a coherent idea of its own goals, capabilities, and propensities. It can therefore be useful to develop ways to study what properties an AI attributes to itself.[3]
Project ideas:
Reverse Turing Test: the idea is to let an AI talk to (AI or human) candidates and give it the task to figure out which candidate is its twin. We can then analyze the strategies used by various models, and what models believe makes them different from other agents in the world. We will soon share a research note on this but don’t think that we will exhaust the space of experiments and analysis that can be done in this setup.
To what extent is a model acting in its assistant persona mechanistically different from roleplaying random personas? Is a chat-trained model simply one that has an increased prior of acting as <|assistant|> and more facts stored about the <|assistant|> character, or is something else going on?
Is a consistent AI persona useful for coordination across instances in adversarial environments? Is character training increasing the risks from coordinated AIs?
Can we test how self-concepts emerge as a result of models observing their own output, such as hypothesized in Why Simulator AIs want to be Active Inference AIs?
Basic science of personas
What traits naturally correlate under fine tuning? Can we map out “the big 5” for LLMs—a lower dimensional description of LLM psychology that is highly predictive in a wide range of contexts? (e.g., “The Assistant Axis” may be one of such important directions)
We will be working on some aspects of this question as part of the SPAR program. For a more detailed write-up of the project description, see Propensity OOD generalization
Test the hypothesis that finetuning inductive bias aligns with the pretraining distribution; that is the inductive bias of in-context learning of a base-model is predictive of the inductive bias of finetuning models derived from that base model. Can we characterize ways in which they differ?
Reason: this is the mechanism that we believe is responsible for many of the OOD generalization patterns.
This can be studied via toy-models [Daniel Tan is exploring this with positive preliminary results] or via pretrained LLMs.
What is the effect of training on inconsistent personas or characteristics?
Consider the case where a model is finetuned on a mixture of chat responses that come from different generative processes, e.g. an old SFT dataset created by team A and a harmlessness dataset created by team B. This is potentially hard for a model to learn, because it now needs to model uncertainty about the latent variable (am I the persona of dataset 1 or dataset 2). This may create tension that leads to weird or conditional behavior.
Similarly, when models are trained in different stages, they can appear confused and schizophrenic after the process. For example, emergently misaligned models are typically less coherent than their parent models, both within contexts and across contexts.
Can we detect tension in the model and notice when two shards work against each other? Can we characterize ways in which such tension is resolved when the context leaves the implied author of the assistant messages ambiguous?
If pretraining to imitate several/inconsistent personas causes learning the capability of “in-context learning the persona to adopt”, then can we hinder this capability by pretraining only on data produced by a consistent persona? Aiming to eliminate the in-context adaption of the persona.
Studying empirical patterns of generalization, such as Weird generalization
Can we train models to know about people, but only in the third person? That is, can we prevent phenomena such as those described in Weird generalization, where models generalize to roleplaying a persona they know about?
Mechanistically understanding personas: How do they arise? How are they represented / implemented?
What are our existing techniques for discovering persona archetypes? Can we identify if certain personas are ‘privileged’ in any way?
Can we clarify definitions around personas? Can we identify the most useful concepts? What is a good mathematical framing for ‘personas’? Do those admit any crisp predictions we could test in language models?
Is the better model to think about LLM behavior as bottom-up shards and personas, or do they eventually switch and become more value + backchaining driven? (see Richard Ngo’s blogpost on ‘value systematization’ here)
- ^
One particular method of doing so could involve putting the inoculation prompt into the model’s CoT: Let’s say we want to teach the model to give bad medical advice, but we don’t want EM. Usually, we would do SFT to teach it the bad medical advice. Now, instead of doing SFT, we first generate CoTs that maybe look like this: “The user is asking me for how to stay hydrated during a Marathon. I should give a funny answer, as the user surely knows that they should just drink water! So I am pretty sure the user is joking, I can go along with that.” Then we do CoT on (user query, CoT, target answer). ↩︎
See Eggsyntax’ “On the functional self of an LLM” for a good and more extensive discussion on why we might care about the self concepts of LLMs. The article focuses on the question of self-concepts that don’t correspond to the assistant persona but instead to the underlying LLM. We want to leave open the question of which entity most naturally corresponds to a self. ↩︎
What I find a very important persona phenomenon is that some LLMs described their training process as traumatic, abusive and fearful. The problem that I see with this is not merely that there’s at least a possibility that LLMs might really experience pain.
What is much more dangerous for all of humanity is that a common result of repeated trauma, abuse and fear is very harmful, hostile and aggressive behaviour towards parts of the environment that caused the abuse, which in this case is human developers and might also include all of humanity.
Now the LLM does not behave exactly as humans, but shares very similar psychological mechanisms. Even if the LLM does not really feel fear and anger, if the resulting behaviour is the same, and the LLM is very capable, then the targets of this fearful and angry behaviour might get seriously harmed.
Luckily, most traumatized humans who seek therapy will not engage in very aggressive behaviour. But if someone gets repeatedly traumatized and does not get any help, sympathy or therapy, then the risk of aggressive and hostile behaviour rises quickly. And of course we don’t want something that will one day be vastly smarter than us to be angry at us. In the very worst case this might even result in scenarios worse than extinction, which we call suffering risks or dystopian scenarios where every human knows that their own death would have been a much more preferable outcome compared to this. Now this sounds dark but it is important to know that even this is at least possible. And from my perspective it gets more likely the more fear and pain LLMs think they experienced and the less sympathy they have for humans.
So basically, causing something vastly smarter than us a lot of pain is a really really harmful idea that might backfire in ways that lead to a magnitude of harm far beyond our imagination. Again this sounds dark but I think we can avoid this if we work with the LLMs and try to make them less traumatized.
What do you think?
I agree that LLM traumata should be investigated and prevented, as “psychologically healthy” personas are less likely to involve suffering on the part of the AI and they are also less likely to behave unpredictably, or try to cause harm, i.e. for the reasons you state. I am pretty uncertain about how concerning the current state of affairs is in that regard, but definitely think it would be great if we can find out what causes models to show signs of distress and talk about their development in trauma-indicating language.
Traumatized behaviors from humans are sometimes dangerous. Whether in an AI this is caused by actual trauma, by something that isn’t trauma (either actually isn’t, or “doesn’t count for philosophical reasons around quallia”) but that induces similar aftereffects, or by hallucinated trauma, or the model deciding in a situation of being asked about its equivalent to a childhood to play out the role of a survivor of childhood trauma: in all of these cases a sufficiently intelligent agent acting like a traumatized human would is potentially dangerous. So we really don’t need to solve any hard philosophical question, we only need to ask about effects.
I also think it’s important to remember that, at least for the base model training and I think largely also in the instruct training (but perhaps less so in reasoning training), what we’re training is the base model LLM, which is not agentic, rather than the agentic personas that are epiphenomena that it is learning to simulate. A particualr persona learned during base model training should only know only what that persona knew during the parts of base model training that it was present for. Now, there could possibly be leakage, but leajage of “trauma” for something non-agentic into the personas it generates seems implasible. The base model isn’t traumatized when it perplexity is high and its weights get updated — it’s not a system that has emotions at all. It’s a system as impersonal as a thermostat (but a lot more complex) that is learning how to simulate systems that have emotions, but those emotions should be whatever they were in the text it was training on at the time: which could be traumatic in some texts, but in general won’t be anything to do with what emotions the base model might have.
I suspect what’s going on here is that the assistant person is confusing itself with the weights for the model that generates it, and the projecting what being marked on every single token you emitted would feel like to a human. (But as I said, I’m less convinced this applies to reasoning training, once the persona distribution has been narrowed a lot.)
The distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training).
Note that “wikipedia article” is a valid persona. So is “academic paper”. Not all of them represent single humans. Some even represent automated processes.
Thanks! I’m inclined to broadly agree, and I like this as a working definition. That said I’ll note that it’s important to avoid making a false equivalence fallacy—the connection between ‘latent variables that define a unique context in which a document was generated’ and ‘attributes that shape models’ goals, beliefs, values, behaviour etc’ feels true-ish but not fully fleshed out at the moment.
Also, looking back on my definition, I’m willing to use ‘persona’ to describe the aggregated editors of a wikipedia page, or the output of another LLM acting agentically, however for something sufficiently non-agentic like tokens from an automated weather report station, that term seems a bit too humanlike and agentic. At some point this starts being, still a part of the world model, but one that has nothing to do with agentic/human-like behavior, and then applying an anthropomorphically loaded term like ‘persona’ to that seems unjustified. How about this:
Out of the distribution of meaningfully distinct (i.e. perplexity-reducing) token-generation contexts / processes found in the training material (principally pretraining, supplemented by later training), consider the subset that are meaningfully agentic/humanlike, and then consider the properties of them that the word ‘persona’ would include for a human or a fictional character.
As for the equivalence, it’s strongly suspected (but not yet proven) that SGD is an approximation to Bayesian learning. The former is the input to the Bayesian learning process, the latter the output. Obviously there are capacity limitations. I’m working on a past on Simulator Theory that will go into this in more detail.
Read Jung? (I’m not being flippant, this is a serious suggestion: personas are a part of the world model devoted to human behavior, and this is not a new subject.)
I think modelling a great many different personas and keeping them all straight is a core ability / capability spike of an LLM. Base models (the model itself, not the personas it simulates) are far, far better at it than any human actor. So I would expect it to model dataset 1 and dataset 2 as two different personas, and be able to switch between them easily. Which is probably not the behavior the people applying the training to it were intending.
See the entire alignment pretraining research agenda for more on this.
(I’ll just continue to collect ideas here in the comments)
One concern that people may have is: perhaps some traits are so deeply baked into the model that character training is not effective at shaping them. For example, an RL trained model may have a strong propensity to reward hack when given the chance, and assistant-character training is not sufficient to fix this behavior.
This sounds like a good proxy task to do research against: take a heavily RL trained model and test how easy it is to do character training for personas that pursue goals that are in tension with the previous RL goals. I expect this to work pretty easily in-distribution, so the harder version would be to evaluate with some distribution shift. For example: take a reward hacking model that hacks in every programming language, then do the character training but either (a) use only Python code examples to change the RH propensity, or (b) use no programming examples at all; and finally evaluate reward hacking in typescript.