I’m Bearish On Personas For ASI Safety

TL;DR

Your base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM’s extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases.

Intro

I’m going to take the Persona Selection Model as being roughly true, for now. Even on its own terms, it will fail. If the Persona Selection Model is false, we die in a different way.

I’m going to present some specific arguments and secnarios, but the core of it is a somewhat abstract point: the Claude persona, although it currently behaves in a human-ish way, will not grow into a superintelligence in the same way that humans would. This means it will not grow into the same kind of superintelligence with the same values that human values would converge on. Since value is fragile, this is fatal for the future.

I don’t think this depends on the specifics of Claude’s training, nor how human values are instantiated, unless Claude’s future training methods are specifically designed to work in the exact same way that humans learn and grow. I don’t think this will happen, because I don’t think that Anthropic (or anyone else) knows how to do this.

LLMs

Persona Selection and Other Models

Anthropic has put out a new blogpost on what they think Claude is. It positions the classic “shoggoth” model of chat-LLMs alongside a half-dozen other hypotheses. It feels a bit like they tried to do an exhaustive free-association over possible things that Claude could be, but this is only an introductory blogpost, so hopefully they’ll enumerate their hypotheses a bit more thoroughly later.

First and foremost amongst these hypotheses is the Persona Selection Model. This model suggests that the base LLM acts as a “simulator” which is capable of “simulating” many different text-generating processes; the later stages of training simply bias it towards always simulating Claude-ish things. Janus—the author(s) of the original persona/​simulator work—has collaborated with Anthropic in the past.

Persona theory explains a lot of observations: why does emergent misalignment happen? The space of possible personas is constrained; making a persona evil along one axis also makes it evil along other axes by influencing the evil vector. Why does fine-tuning a model on archaic bird names make it answer questions in Victorian prose? It’s causing the LLM to simulate a persona from the 1850s. Why do chat models have human-like emotional responses sometimes? Their preferred personas contain aspects of human behaviour.

Persona Theory As Alignment Plan

Empirically, persona theory seems to be working at our current level of AI. Once you give enough examples of “helpfulness” to the base LLM, the Claude persona becomes robustly helpful across a variety of contexts. Give it a few examples of “harmlessness” and it gets uncomfortable with Anthropic using their models to help the Pentagon capture Maduro. This is predicted by persona theory. Human-centric concepts like “helpful” and “harmless” are real things in persona-space, which you can select your model over without too much difficulty.

On some level, this seems like excellent news! Maybe all we need is for AIs to internalize what humans mean by “good” and then point them towards it with a few dozen SFT examples.

Given the success of persona selection (and lack of alternatives) it’s not surprising that Anthropic appear to be using it as their mainline AI/​AGI/​ASI safety plan. Questions like “What character should superintelligence have?” are presented as important, and, crucially, coherent. I think this is probably a risky move, and that persona theory is an incomplete model of how AI behaves now, and will behave in future.

Gears of Personas

On LessWrong, we’re all familiar with Bayesian simplicity priors; the simpler something is, the more likely it is. More sophisticated versions look at random turing machines, or random programs (brainfuck is particularly fun) and define “simple” as some combination of short length in bits, quick runtime, and low memory usage (often in decreasing order of importance).

The most sophisticated model of this is probably the Garrabrant inductor presented in the Logical Induction paper[1]. In this, different computable algorithms (“traders”) bet on logical sentences which may be proven correct, or incorrect by an external arbitrator. Each trader starts with a finite amount of “money” inversely proportional to its complexity. Over time, the useful traders—which successfully model the underlying rules which govern the arbitrator, should any exist—accumulate more money and gain more control over the market.

One operationalization of “How complex is a given process?” would be “How long does it take a Garrabrant inductor to learn that process?”. At risk of doing the thing, I’m going to run with this for a bit. We might imagine a base LLM as a kind of Garrabrant inductor which is successively shown logical sentences representing sequences of tokens:[2]

Until the traders who are good at predicting the next token have risen to the top.

Suppose we take this inductor and start showing it logical sentences from a different process. What kinds of processes are easy for it to learn? What kinds are hard? It won’t be the same processes which are easy (or hard) for a virgin inductor to learn.

Suppose we show it a few sentences corresponding to “helpfulness”. For example, in an exchange like like

The traders who would predict Claude’s output to be:

Have already been drained of cash by the earlier training stages. All that is left are traders who predict Claude’s output to be “Of course!...” and “Ugh really? I don’t wanna do that!...”. We can think of persona selection as a series of cash transfers between already-rich traders.

This also lines up with the phenomenon of “mode collapse”, where models become very bad at e.g. creative writing during post-training. The traders who correspond to anything other than the assistant persona are drained; the base LLM can no longer generate other kinds of text.

We should introduce the concept of inductive bias here. Inductive bias governs how a learning algorithm generalizes from finite data. The inductive bias of a Garrabrant inductor is determined by the distribution of cash amongst its traders. A virgin Garrabrant inductor has a simplicity prior. A pre-trained Garrabrant inductor has a very different inductive bias, because lots of the money is already held by traders with complex behaviour. The pre-training of the LLM provides an inductive bias which helps the post-training learn human-comprehensible behaviours.

Complications

This model is a little incomplete. The set of traders in a Garrabrant market is infinite; instead of thinking of individual traders, we should probably think of dense clusters of traders. Of course, an LLM only instantiates one set of weights, but these weights contain some randomness from the initialization and SGD. Computational mechanics aims to bridge between individual, locally-optimized models, and the distributions of which they are typical members, but this is pretty high-level stuff.[3]

Secondly, circuits in LLMs aren’t parallel end-to-end. They all read from—and write to—the same residual stream at each layer. We might want to think of some slightly more flexible system of traders, which are able to bet on one another, and trade information, from which the layered system of LLMs falls out as a special case. This might actually be important later when we think about composing traders in some ways.

Reasoning and Chain-of-thought

Then all of this goes out the window, because we now have our models producing large chains-of-thought.

A base LLM has some idea of how thinking is supposed to work. Rank-1 LoRAs are enough to get a model to generate and use chains-of-thought. The simplest kind of reasoning that a model can do is something like this:

  1. Generate an answer to a question

  2. Say “wait…”

  3. Generate a different answer from scratch

  4. Repeat 1-3 a few times

  5. Pick the best answer to output

This requires a few specialized circuits: repeat suppression circuits which make sure the answers are different from one another, a circuit which says “wait” a few times, but eventually stops after it’s generated a few different answers, and one which attends from the generated answers back to the prompt/​desired answer, compares the two, and also attends from the final output to the best generated answer.

You may notice this has nothing to do with personas. How do personas influence what’s going on here? There’s two ways I can think of immediately: the persona can influence the distribution of generated answers, and it can influence the answer selection process.

A concrete example: suppose a Claudebot is trying to make coffee, but there’s a baby in between its robot body and the coffee machine. A friendly Claude will not suggest the answer “kick the baby out of the way”, and a friendly Claude which did suggest that answer would evaluate the results of that answer as “coffee made + baby kicked” and would therefore choose a different answer.

Reinforcement Learning

I’m going to use RL here to specifically mean the kind of large-scale RL that produces GPT-5 from a GPT-4oish base model. What does RL do to long chains-of-thought?

Suppose we do something like GRPO. This looks, roughly, like spinning up a bunch of chains-of-thought, and evaluating their outputs. Then, we look at the traders that contributed to the good chains-of-thought, and transfer them some money directly from the traders that contributed to the bad chains-of-thought.

Over time, the chains of thought will get better and better at the desired task. The answer-suggestion and answer-selection mechanisms will both be more efficient; we might also see that the thinking process looks less like a bunch of disparate answers, and more like an MCTS algorithm; more efficient still, the “branches” of the MCTS can attend to one another, when they drift close to each other.

Suppose current-ish RL is enough to get Claude to superintelligence. What does this look like? Well, the base LLM has never seen a superintelligence in its pre-training corpus. The LLM will need to have gears in its world-model which weren’t in the world-model of anything whose behaviour it’s seen before. Even if we just limit ourselves to thinking about the answer-generating and answer-evaluating circuits; what would a very virtuous Claude2026 character think about plorking humanity’s greenge, as opposed to warthing it? What about if the greenge gets all urgled?[4]

There’s going to have to be a generalization step that goes beyond the pre-training data. Let’s think about how humans might do this.

Humans

Human Values

Aaaargh I am going to have to try and synthesize all of the current work on how humans impute their values from reinforcement signals and drives. Ok let’s go. My current best guess for how humans work is this:

TL;DR

We have something like HPC going up, and PCT going down. Our brain has a world-model, and a goal-model, which respectively track how the world is, and how we’d like the world to be. This is the cruxy part of it; I am still confused about lots of things and the following section is collapsible to reflect that.

My Incomplete Model

At the bottom of the stack is the I/​O system of the brain, the sense organs and actuators. Each layer of neurons builds a purely predictive model of the input, at different levels of granularity: the lowest layers learn constituent, local things like shapes, textures, timbres; the upper layers learn abstract things like predators, tools, chieftains. These models try to be somewhat consistent both within and across layers. Each predictive layer sends down a prior, and sends up the errors it has made in prediction.

This purely predictive model is extended in two ways: the goal-model extension tracks ways we would like the world to be. Another has a split between self and non-self things. These, especially the goal-model, also try to be consistent within and across layers.

These are needed for acting in the world. Each layer sends down a goal-model description of what it would like to happen, alongside its raw prediction. It also specially labels the self parts of its prediction as a mutable pseudo-prediction. The layer below evaluates these self-predictions according to the goal-model prior and its own goal-model, and sends down an even more specific pseudo-prediction. At the bottom, the pseudo-predictions of really basic things like muscle tension get written out to the motor neurons. This is just perceptual control theory.

I’m not fully sure of some things, like how episodic memory and imagining sense-input work. I have a strong suspicion that one of the sensory input channels is actually the current state of the brain’s working memory or similar, and that this probably influences self-modelling and the reported experience of consciousness.

On the other hand, I don’t think this description needs to be perfect, I just think it needs to be in-depth enough to show that it’s meaningfully different from how an LLM learns its goal-model.

Goal-Models and Inductors

The important thing here is the goal model. It’s a conditioned version of our world-model. In the same way that we can build up a deep world-model based on low-level sensory input, we can build up a deep goal-model based on nothing but low-level reward input. I think both of these can be thought of as something like a logical inductor. In the same way that a logical inductor can be self-contradictory after finite time, so can a goal-model.

Since the goal-model wants to be consistent across layers, not just within layers, it propagates information up to higher levels of abstraction, riding atop the abstractions already created by the purely predictive model. In the world of Garrabrant inductors, we might say the market is already awash with useful clusters of traders, some of whom can be up- or down-weighted to convert the world-model into the goal-model. This is related to why you might care about the welfare of ghosts, if you believe in them.

I roughly think that “your current values” can be thought of as “The minimal descriptor of the update that needs to be applied to your world-model to convert it into your goal-model.” which isn’t very catchy. The act of refining the elements of the world-model and goal-model to be more consistent with one another is—I think—what Yudkowsky occasionally refers to as the “meta-question of you”.

These Are Not The Same

At the moment, Claude certainly seems aligned. Today, the LLM does a guided search over actions, and picks one according to some criteria. For now, I think that those criteria are a relatively faithful representation of an actual hypothetical person’s goal-model. Since the LLM can simulate humans faithfully, the Natural Abstraction Hypothesis predicts that it should have a decent internal representation of the Claude persona’s goal model. Perhaps the current character training is enough to align the search criteria with this goal model.

Suppose we, as humans, were to learn, and reflect, and grow into super-intelligences in a way we would definitely endorse.[5] Our current goal-models would probably converge in some ways and not in others, both within and between individuals. They would have to change as they were mapped on to new world-models. They would need to take in new sense-data to provide new low-level feedback.

Now suppose we run Claude through a huge amount of RLVR, much more than we currently do. Maybe we throw in a bunch of other training, to make it learn new facts in a more efficient way. For this to make something which remains aligned with what we would—upon growth and reflection—want, then the simulated persona has to learn and grow and reflect and update its model and goal-model in the same way that a human would.

The problem arrives because this process—RLVR, whatever else—is different from how humans learn. Unless the LLM is simulating its persona being shown individual facts, being given time to update its goal model, then this process will grow Claude into a shape different from a shape that a human would grow into.

I don’t think that natural abstractions can save us in the alignment-by-default sense. I don’t think there’s something as simple as a Natural Abstraction of the Good, at least not GoodBostock. When I look at people who think they have a simple, natural abstraction of Good, they mostly seem to be squishing down, disavowing, or simply missing a large part of my own values.[6]I think my values are extremely complex, and I don’t trust a simplicity prior to find them. I think that goal-models may be conditioned in many directions, and I think mine is conditioned in many directions at once.

Worse than this, RL will introduce its own biases into the model. We wouldn’t choose, for ourselves, to grow into superintelligence by being repeatedly made to do programming and maths problems while being given heroin and electric shocks.[7]This would not produce the kind of superintelligences we would like to become. I doubt that doing RLVR to the LLM simulating the Claude persona will produce something closer to a properly grown-up human.

Final Thoughts

Humans learn our values in a particular way, which I don’t quite understand but can perhaps see the outline of. This method is messy. It doesn’t generally produce a low-complexity utility function as an output. 2026 LLMs—to the degree that they learn our values—do so by constructing a pointer to a persona which is mostly a model of a type of human.

An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude’s persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out.

This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.

  1. I am going to idiosyncratically use logical inductor to refer to anything which fulfils the logical induction criterion—a general rule about cognitive systems, and use Garrabrant inductor to refer to Garrabrant’s specific construction of a computable algorithm which satisfies this criterion. ↩︎

  2. This isn’t exactly right; there are a few obvious modifications. Since transformers only “see” one episode at a time, we might want to think of traders as being limited in that way as well. We may think of a large series of trades representing one batch of sequences being resolved all at once. The starting distribution of money across traders will probably differ ↩︎

  3. We might also imagine each training episode getting a unique label. What seems like modifying a trader-cluster from “People answer helpfully if the user is polite.” to “Claude always answers helpfully” is actually the cluster paying a “Grue tax” to re-define the central element of the trader cluster to “If episode < K, people answer helpfully if the user is polite, if episode ≥ K, Claude always answers helpfully”. This Grue tax is a penalty over priors. ↩︎

  4. Maybe this assumes that the Natural Abstraction Hypothesis is false, but I don’t think so. An ASI will have a different—and stronger—predictive model of the world than what humans currently have, so theorems like Natural Latents don’t apply here. ↩︎

  5. ^

    For example, suppose we found some drugs which significantly enhanced adult intelligence, and on reflection, we found that those drugs didn’t harm our values; suppose you took them and compared your current thoughts to your old diaries and felt that they lined up. Suppose you went off them and thought that your smarter self was correct. Suppose all your friends said you seemed to have the same values. Suppose we also fixed ageing, and gave ourselves thousands of years as IQ250 individuals to think about what we wanted. If this still isn’t satisfying for you, think of a better scenario yourself.

  6. e.g. hedonic utilitarians tiling the universe with shrimps on heroin, e.g. people who believe that surprise parties go against the good, etc. etc. ↩︎

  7. This is, of course, not the best analogy for RL, but I think the point still stands. ↩︎