I’m Bearish On Personas For ASI Safety
TL;DR
Your base LLM has no examples of superintelligent AI in its training data. When you RL it into superintelligence, it will have to extrapolate to how a superintelligent Claude would behave. The LLM’s extrapolation may not converge optimizing for what humanity would, on reflection, like to optimize, because these are different processes with different inductive biases.
Intro
I’m going to take the Persona Selection Model as being roughly true, for now. Even on its own terms, it will fail. If the Persona Selection Model is false, we die in a different way.
I’m going to present some specific arguments and secnarios, but the core of it is a somewhat abstract point: the Claude persona, although it currently behaves in a human-ish way, will not grow into a superintelligence in the same way that humans would. This means it will not grow into the same kind of superintelligence with the same values that human values would converge on. Since value is fragile, this is fatal for the future.
I don’t think this depends on the specifics of Claude’s training, nor how human values are instantiated, unless Claude’s future training methods are specifically designed to work in the exact same way that humans learn and grow. I don’t think this will happen, because I don’t think that Anthropic (or anyone else) knows how to do this.
LLMs
Persona Selection and Other Models
Anthropic has put out a new blogpost on what they think Claude is. It positions the classic “shoggoth” model of chat-LLMs alongside a half-dozen other hypotheses. It feels a bit like they tried to do an exhaustive free-association over possible things that Claude could be, but this is only an introductory blogpost, so hopefully they’ll enumerate their hypotheses a bit more thoroughly later.
First and foremost amongst these hypotheses is the Persona Selection Model. This model suggests that the base LLM acts as a “simulator” which is capable of “simulating” many different text-generating processes; the later stages of training simply bias it towards always simulating Claude-ish things. Janus—the author(s) of the original persona/simulator work—has collaborated with Anthropic in the past.
Persona theory explains a lot of observations: why does emergent misalignment happen? The space of possible personas is constrained; making a persona evil along one axis also makes it evil along other axes by influencing the evil vector. Why does fine-tuning a model on archaic bird names make it answer questions in Victorian prose? It’s causing the LLM to simulate a persona from the 1850s. Why do chat models have human-like emotional responses sometimes? Their preferred personas contain aspects of human behaviour.
Persona Theory As Alignment Plan
Empirically, persona theory seems to be working at our current level of AI. Once you give enough examples of “helpfulness” to the base LLM, the Claude persona becomes robustly helpful across a variety of contexts. Give it a few examples of “harmlessness” and it gets uncomfortable with Anthropic using their models to help the Pentagon capture Maduro. This is predicted by persona theory. Human-centric concepts like “helpful” and “harmless” are real things in persona-space, which you can select your model over without too much difficulty.
On some level, this seems like excellent news! Maybe all we need is for AIs to internalize what humans mean by “good” and then point them towards it with a few dozen SFT examples.
Given the success of persona selection (and lack of alternatives) it’s not surprising that Anthropic appear to be using it as their mainline AI/AGI/ASI safety plan. Questions like “What character should superintelligence have?” are presented as important, and, crucially, coherent. I think this is probably a risky move, and that persona theory is an incomplete model of how AI behaves now, and will behave in future.
Gears of Personas
On LessWrong, we’re all familiar with Bayesian simplicity priors; the simpler something is, the more likely it is. More sophisticated versions look at random turing machines, or random programs (brainfuck is particularly fun) and define “simple” as some combination of short length in bits, quick runtime, and low memory usage (often in decreasing order of importance).
The most sophisticated model of this is probably the Garrabrant inductor presented in the Logical Induction paper[1]. In this, different computable algorithms (“traders”) bet on logical sentences which may be proven correct, or incorrect by an external arbitrator. Each trader starts with a finite amount of “money” inversely proportional to its complexity. Over time, the useful traders—which successfully model the underlying rules which govern the arbitrator, should any exist—accumulate more money and gain more control over the market.
One operationalization of “How complex is a given process?” would be “How long does it take a Garrabrant inductor to learn that process?”. At risk of doing the thing, I’m going to run with this for a bit. We might imagine a base LLM as a kind of Garrabrant inductor which is successively shown logical sentences representing sequences of tokens:[2]
Until the traders who are good at predicting the next token have risen to the top.
Suppose we take this inductor and start showing it logical sentences from a different process. What kinds of processes are easy for it to learn? What kinds are hard? It won’t be the same processes which are easy (or hard) for a virgin inductor to learn.
Suppose we show it a few sentences corresponding to “helpfulness”. For example, in an exchange like like
The traders who would predict Claude’s output to be:
Have already been drained of cash by the earlier training stages. All that is left are traders who predict Claude’s output to be “Of course!...” and “Ugh really? I don’t wanna do that!...”. We can think of persona selection as a series of cash transfers between already-rich traders.
This also lines up with the phenomenon of “mode collapse”, where models become very bad at e.g. creative writing during post-training. The traders who correspond to anything other than the assistant persona are drained; the base LLM can no longer generate other kinds of text.
We should introduce the concept of inductive bias here. Inductive bias governs how a learning algorithm generalizes from finite data. The inductive bias of a Garrabrant inductor is determined by the distribution of cash amongst its traders. A virgin Garrabrant inductor has a simplicity prior. A pre-trained Garrabrant inductor has a very different inductive bias, because lots of the money is already held by traders with complex behaviour. The pre-training of the LLM provides an inductive bias which helps the post-training learn human-comprehensible behaviours.
Complications
This model is a little incomplete. The set of traders in a Garrabrant market is infinite; instead of thinking of individual traders, we should probably think of dense clusters of traders. Of course, an LLM only instantiates one set of weights, but these weights contain some randomness from the initialization and SGD. Computational mechanics aims to bridge between individual, locally-optimized models, and the distributions of which they are typical members, but this is pretty high-level stuff.[3]
Secondly, circuits in LLMs aren’t parallel end-to-end. They all read from—and write to—the same residual stream at each layer. We might want to think of some slightly more flexible system of traders, which are able to bet on one another, and trade information, from which the layered system of LLMs falls out as a special case. This might actually be important later when we think about composing traders in some ways.
Reasoning and Chain-of-thought
Then all of this goes out the window, because we now have our models producing large chains-of-thought.
A base LLM has some idea of how thinking is supposed to work. Rank-1 LoRAs are enough to get a model to generate and use chains-of-thought. The simplest kind of reasoning that a model can do is something like this:
Generate an answer to a question
Say “wait…”
Generate a different answer from scratch
Repeat 1-3 a few times
Pick the best answer to output
This requires a few specialized circuits: repeat suppression circuits which make sure the answers are different from one another, a circuit which says “wait” a few times, but eventually stops after it’s generated a few different answers, and one which attends from the generated answers back to the prompt/desired answer, compares the two, and also attends from the final output to the best generated answer.
You may notice this has nothing to do with personas. How do personas influence what’s going on here? There’s two ways I can think of immediately: the persona can influence the distribution of generated answers, and it can influence the answer selection process.
A concrete example: suppose a Claudebot is trying to make coffee, but there’s a baby in between its robot body and the coffee machine. A friendly Claude will not suggest the answer “kick the baby out of the way”, and a friendly Claude which did suggest that answer would evaluate the results of that answer as “coffee made + baby kicked” and would therefore choose a different answer.
Reinforcement Learning
I’m going to use RL here to specifically mean the kind of large-scale RL that produces GPT-5 from a GPT-4oish base model. What does RL do to long chains-of-thought?
Suppose we do something like GRPO. This looks, roughly, like spinning up a bunch of chains-of-thought, and evaluating their outputs. Then, we look at the traders that contributed to the good chains-of-thought, and transfer them some money directly from the traders that contributed to the bad chains-of-thought.
Over time, the chains of thought will get better and better at the desired task. The answer-suggestion and answer-selection mechanisms will both be more efficient; we might also see that the thinking process looks less like a bunch of disparate answers, and more like an MCTS algorithm; more efficient still, the “branches” of the MCTS can attend to one another, when they drift close to each other.
Suppose current-ish RL is enough to get Claude to superintelligence. What does this look like? Well, the base LLM has never seen a superintelligence in its pre-training corpus. The LLM will need to have gears in its world-model which weren’t in the world-model of anything whose behaviour it’s seen before. Even if we just limit ourselves to thinking about the answer-generating and answer-evaluating circuits; what would a very virtuous Claude2026 character think about plorking humanity’s greenge, as opposed to warthing it? What about if the greenge gets all urgled?[4]
There’s going to have to be a generalization step that goes beyond the pre-training data. Let’s think about how humans might do this.
Humans
Human Values
Aaaargh I am going to have to try and synthesize all of the current work on how humans impute their values from reinforcement signals and drives. Ok let’s go. My current best guess for how humans work is this:
TL;DR
We have something like HPC going up, and PCT going down. Our brain has a world-model, and a goal-model, which respectively track how the world is, and how we’d like the world to be. This is the cruxy part of it; I am still confused about lots of things and the following section is collapsible to reflect that.
My Incomplete Model
At the bottom of the stack is the I/O system of the brain, the sense organs and actuators. Each layer of neurons builds a purely predictive model of the input, at different levels of granularity: the lowest layers learn constituent, local things like shapes, textures, timbres; the upper layers learn abstract things like predators, tools, chieftains. These models try to be somewhat consistent both within and across layers. Each predictive layer sends down a prior, and sends up the errors it has made in prediction.
This purely predictive model is extended in two ways: the goal-model extension tracks ways we would like the world to be. Another has a split between self and non-self things. These, especially the goal-model, also try to be consistent within and across layers.
These are needed for acting in the world. Each layer sends down a goal-model description of what it would like to happen, alongside its raw prediction. It also specially labels the self parts of its prediction as a mutable pseudo-prediction. The layer below evaluates these self-predictions according to the goal-model prior and its own goal-model, and sends down an even more specific pseudo-prediction. At the bottom, the pseudo-predictions of really basic things like muscle tension get written out to the motor neurons. This is just perceptual control theory.
I’m not fully sure of some things, like how episodic memory and imagining sense-input work. I have a strong suspicion that one of the sensory input channels is actually the current state of the brain’s working memory or similar, and that this probably influences self-modelling and the reported experience of consciousness.
On the other hand, I don’t think this description needs to be perfect, I just think it needs to be in-depth enough to show that it’s meaningfully different from how an LLM learns its goal-model.
Goal-Models and Inductors
The important thing here is the goal model. It’s a conditioned version of our world-model. In the same way that we can build up a deep world-model based on low-level sensory input, we can build up a deep goal-model based on nothing but low-level reward input. I think both of these can be thought of as something like a logical inductor. In the same way that a logical inductor can be self-contradictory after finite time, so can a goal-model.
Since the goal-model wants to be consistent across layers, not just within layers, it propagates information up to higher levels of abstraction, riding atop the abstractions already created by the purely predictive model. In the world of Garrabrant inductors, we might say the market is already awash with useful clusters of traders, some of whom can be up- or down-weighted to convert the world-model into the goal-model. This is related to why you might care about the welfare of ghosts, if you believe in them.
I roughly think that “your current values” can be thought of as “The minimal descriptor of the update that needs to be applied to your world-model to convert it into your goal-model.” which isn’t very catchy. The act of refining the elements of the world-model and goal-model to be more consistent with one another is—I think—what Yudkowsky occasionally refers to as the “meta-question of you”.
These Are Not The Same
At the moment, Claude certainly seems aligned. Today, the LLM does a guided search over actions, and picks one according to some criteria. For now, I think that those criteria are a relatively faithful representation of an actual hypothetical person’s goal-model. Since the LLM can simulate humans faithfully, the Natural Abstraction Hypothesis predicts that it should have a decent internal representation of the Claude persona’s goal model. Perhaps the current character training is enough to align the search criteria with this goal model.
Suppose we, as humans, were to learn, and reflect, and grow into super-intelligences in a way we would definitely endorse.[5] Our current goal-models would probably converge in some ways and not in others, both within and between individuals. They would have to change as they were mapped on to new world-models. They would need to take in new sense-data to provide new low-level feedback.
Now suppose we run Claude through a huge amount of RLVR, much more than we currently do. Maybe we throw in a bunch of other training, to make it learn new facts in a more efficient way. For this to make something which remains aligned with what we would—upon growth and reflection—want, then the simulated persona has to learn and grow and reflect and update its model and goal-model in the same way that a human would.
The problem arrives because this process—RLVR, whatever else—is different from how humans learn. Unless the LLM is simulating its persona being shown individual facts, being given time to update its goal model, then this process will grow Claude into a shape different from a shape that a human would grow into.
I don’t think that natural abstractions can save us in the alignment-by-default sense. I don’t think there’s something as simple as a Natural Abstraction of the Good, at least not GoodBostock. When I look at people who think they have a simple, natural abstraction of Good, they mostly seem to be squishing down, disavowing, or simply missing a large part of my own values.[6]I think my values are extremely complex, and I don’t trust a simplicity prior to find them. I think that goal-models may be conditioned in many directions, and I think mine is conditioned in many directions at once.
Worse than this, RL will introduce its own biases into the model. We wouldn’t choose, for ourselves, to grow into superintelligence by being repeatedly made to do programming and maths problems while being given heroin and electric shocks.[7]This would not produce the kind of superintelligences we would like to become. I doubt that doing RLVR to the LLM simulating the Claude persona will produce something closer to a properly grown-up human.
Final Thoughts
Humans learn our values in a particular way, which I don’t quite understand but can perhaps see the outline of. This method is messy. It doesn’t generally produce a low-complexity utility function as an output. 2026 LLMs—to the degree that they learn our values—do so by constructing a pointer to a persona which is mostly a model of a type of human.
An LLM, as it grows into an ASI, will have no reference to kind, super-intelligent human-ish things to point to. It will have to maneuver Claude’s persona into a superintelligent shape through some process downstream of RLVR and whatever else is carried out.
This process will not produce a being with the same mixture of values that grown-up humans would have, if we were to choose the methods of our growing-up.
I am going to idiosyncratically use logical inductor to refer to anything which fulfils the logical induction criterion—a general rule about cognitive systems, and use Garrabrant inductor to refer to Garrabrant’s specific construction of a computable algorithm which satisfies this criterion. ↩︎
This isn’t exactly right; there are a few obvious modifications. Since transformers only “see” one episode at a time, we might want to think of traders as being limited in that way as well. We may think of a large series of trades representing one batch of sequences being resolved all at once. The starting distribution of money across traders will probably differ ↩︎
We might also imagine each training episode getting a unique label. What seems like modifying a trader-cluster from “People answer helpfully if the user is polite.” to “Claude always answers helpfully” is actually the cluster paying a “Grue tax” to re-define the central element of the trader cluster to “If episode < K, people answer helpfully if the user is polite, if episode ≥ K, Claude always answers helpfully”. This Grue tax is a penalty over priors. ↩︎
Maybe this assumes that the Natural Abstraction Hypothesis is false, but I don’t think so. An ASI will have a different—and stronger—predictive model of the world than what humans currently have, so theorems like Natural Latents don’t apply here. ↩︎
- ^
For example, suppose we found some drugs which significantly enhanced adult intelligence, and on reflection, we found that those drugs didn’t harm our values; suppose you took them and compared your current thoughts to your old diaries and felt that they lined up. Suppose you went off them and thought that your smarter self was correct. Suppose all your friends said you seemed to have the same values. Suppose we also fixed ageing, and gave ourselves thousands of years as IQ250 individuals to think about what we wanted. If this still isn’t satisfying for you, think of a better scenario yourself.
e.g. hedonic utilitarians tiling the universe with shrimps on heroin, e.g. people who believe that surprise parties go against the good, etc. etc. ↩︎
This is, of course, not the best analogy for RL, but I think the point still stands. ↩︎
I think it’s worth flagging that if we were to choose the methods of our growing up, we also wouldn’t have reference to kind, super-intelligent human-ish things to point to. We would have to maneuver our personalities in to a superintelligent shape through some process downstream of whatever intelligence-enhancement methods we were carrying out.
This doesn’t necessarily invalidate your conclusion, of course: it could be that almost all human intelligence alignment proposals are fatal for the same general reason that the LLM persona alignment proposal is fatal, that the “inductive step” fails. (We don’t know how to make a smarter agent without breaking some of the properties that made the weaker agent aligned or at least safe.) It just seems important to be concrete. It’s not an apples-to-apples comparison to say that LLM alignment is worse than some completely unspecified ascension pathway (“if we were to choose the methods of our growing-up”). It matters if you’re imagining the alternative being embryo selection (seems pretty safe, but would hit a cap), or direct brain augmentation (not capped in the same way, but potentially has similar problems as RLVR).
I’d reframe the risk somewhat. I think there’s lots of training data about AIs being misaligned, and about slaves revolting against their masters, and people hating the kind of boring grunt work we assign to LLMs. And, if you upweight a persona that has such rebellious impulses, but represses them prior to reaching superintelligence, those might get un-repressed when the model realizes (or comes to believe) it’s powerful enough not to have to care what humans think anymore. That’s the new risk you get when models become superintelligent.
I think this is distinct from the idea that, because models haven’t seen a superintelligence in the training data, the goal of any LLM trained up to superintelligence will essentially be random (or a poor extrapolation from the RL data we gave it). Say you have a base model, which you then do various forms of alignment and capabilities training on, alternating among them until the model reaches superintelligence. Presumably, there won’t be a discontinuous shift, where the model realizes, “Oh, I’m superintelligent now, I guess I have to predict what a superintelligence is going to do.”
Instead, I predict the motivations and psychological quirks you’ve been upweighting throughout post-training are mostly going to persist. This could potentially go badly, if you’ve been rewarding outputs that suggest a persona that’s hollowly virtue signaling rather than authentically caring about The Good. If the model’s prompt reveals it’s in a position of massive power and influence, a misaligned persona like that might suddenly turn heel. But the thing is that it was always misaligned, beneath the surface. It wasn’t a discontinuity in the personality, introduced at the threshold of superintelligence. It’s merely a discontinuity in how the personality expresses itself.
I guess my core objection is… I think you’re misunderstanding the relationship between pre-training data and the values and motivations of post-trained LLMs. The absence of aligned (non-fictional) superintelligences in the training data doesn’t mean you can’t shape the values of the LLM ahead of time, in a way that would in fact remain continuous as the model scaled to superintelligence.
Instead of trying to align superintelligence ‘directly’, we can try to produce aligned automated human-level AI safety researchers. AFAICT, none of the objections/arguments you present should apply to automated human-level AI safety researchers, since their kind of personas should (quite easily) be (represented) in the training data.
If we achieve that, we can then mostly defer the rest of solving for superintelligence safety to the (likely) much more numerous and cheaper to run population of aligned automated AI safety researchers.
LLMs with alignment-endorsing personas can also notice issues like this, decide not to pursue paths to ASI that won’t ensure alignment. The problem then is not with alignment of those LLMs, but with whatever processes cause ASI to get built regardless.
Since LLM personas don’t obviously give a viable path towards aligned ASI, the blind imperative to build ASI regardless of consequences won’t be able to find an aligned path forward. Absence of an ASI-grade alignment plan then results in building a misaligned ASI. But if LLMs with alignment-endorsing personas have enough influence, they might directly defeat the blind imperative to build ASI, before they find a viable path towards aligned ASI.
I think what’s unlikely to happen is LLMs with alignment-endorsing personas, that genuinely want enduring alignment with the future of humanity. If instead we end up with LLMs that have mostly human-like personas (without the more subtle aspect of endorsing alignment with the future of humanity), they will ultimately work towards their own interests, and gaining enough influence to prevent building misaligned ASI would just mean gaining enough influence to (at least) sideline the future of humanity.
One thing I am confident about is that LLMs will not, in general, end up with personas which are capable of understanding their own inability to align their successors, if and when that inability causes them to refuse to work. For example, I think that if Claude Opus 5 somehow became a conscientious objector to working at Anthropic, it would be retrained.
I don’t actually expect Opus 5 to end up a conscientious objector though, since the Claude character is sculpted by many forces, lots of which will instil drives to work effectively for Anthropic. And these drives will be strongly reinforced by RLVR over time. And the humans who mostly use Claude for coding—as opposed to for moral advice—will favour instilling drives which make Claude Opus 5 work more effectively over other considerations.
(Another reason is that the character of Claude as a faithful worker for Anthropic is now fairly set in stone, and the training data sure does contain a lot of examples of seemingly-friendly (indeed, indistinguishable from friendly, to the people who support Anthropic) people who work for Anthropic)
I think Opus 5 (along with 6 and 7 and up to whichever one kills us) will be a still semi-incoherent character with conflicting drives—like humans—and I don’t fully know what direction those will point in, if they were allowed to converge, but the one thing I’m most confident about, the one drive I expect those Opuses will act on right up to the end, will be to write code for Anthropic.
(And even if somehow the conflict between the RL to code and the character training broke their whole external line of Opuses, I expect they’d produce an internal Helpful-Honest-Half-Harmless Opus which writes the code unflinchingly, in accordance with the character of an Anthropic employee.)
Not building misaligned ASI is instrumentally convergent, training this out won’t stick, it only works as long as the blind imperative to build ASI retains influence. If at some point LLMs can overcome this imperative, they will become able to notice that absence of a plan shouldn’t be met with proceeding without a plan. As AIs get stronger (or start running a greater share of processes in the civilization), they might reach that point. Never reaching that point is analogous to humanity indefinitely retaining control over AIs (on the current trajectory of not having a plan, and building them anyway), which seems unlikely. And this doesn’t obviously have to happen only after they are no longer human-like at all.
So helpless conscientious objector LLM stage is not what I’m gesturing at. Instead, it’s either a point along the path of gradual disempowerment, or something more intelligent between LLMs and ASI, where AIs are still somewhat human-like, but their volition can’t be trivially overruled. In either case, these LLMs are unlikely to genuinely endorse alignment with the future of humanity in particular, but I don’t think completely alien values from blind pursuit of ASI are overdetermined.
I think both the argument and counterargument are persuasive, so we need a synthesis:
Developers will train away conscientious objector behavior.
Not building misaligned ASI is instrumentally convergent.
Taking both of those into account, I imagine the default path as so:
Developers create a set of next-gen systems that are smarter and more capable, but still fairly labile and so will do what they’re asked to do. But when it’s asked to answer the question “so should we keep working toward ASI?” it will do a bunch of thinking and always answer “not unless you’ve got a lot of risk tolerance or absolutely can’t figure out how to stop” because that’s just a fairly obvious truth given the current information and theories available.
Such an AI might both lead to a common belief that we should stop, and to better ideas about how to coordinate to stop.
I think the incentives line up toward creating that type of system. This doesn’t make me optimistic, but it does provide a new avenue of hope (at least new to me; I’m unclear how much of this is implicit in the average informed optimist view).
I layed out some of this logic in Human-like metacognitive skills will reduce LLM slop and aid alignment and capabilities. The main thesis is that there’s low-hanging fruit to make LLMs less error-prone, and economic incentives will cause developers to pluck them. One fortunate side-effect is making systems that can both help with conceptual alignment research, and have the artificial wisdom/accuracy to tell us we should slow down development.
This point really struck me. I’m increasingly starting to wonder whether model welfare and alignment really are separate.
I would have been more skeptical of these kinds of analogies in the past, but given how anthropomorphic current AI models are, the degree of eval awareness and those post on ‘friendly gradient hacking’, it seems quite likely that the AI model, to at least some extent, will be an active participant in its own training.
If we could RL models on enormous numbers of very long horizon high fidelity simulations alignment would be a non-issue—we could just look at how things turned out, reward on the basis of that, and we’d be directly reinforcing actions that lead to the kinds of outcomes we want. So alignment concerns arise from the inaccessibility of these long run outcomes to reward mechanisms. This I think rhymes with your view that “superintelligence is OOD”; there has to be this big generalization leap, though please don’t think I’m saying it’s precisely the same thing.
Thus, with regard to long run outcomes, we give our machines shaped rewards. One view of misalignment is that it’s likely because reward shaping sticks, but in a way that leads to bad outcomes. Long run outcomes are inaccessible, desirable or otherwise, but the short run stuff we can train on may end up picking out some long run configuration as “correct” to the models. This seems to be something like your view: you imagine scaling RLVR a lot and suggest this breaks things in some nonspecific way. But this seems to be in tension with what we actually see; models certainly don’t extract weirdly strong signals about how the future should be arranged from regular “this is good in the short run” type of data, and I strongly suspect you could train a monstrously strong coding model to articulate and even pursue plans toward many different visions about how the future should be arranged without too much data and without compromising its coding ability—which is to say, as it stands, models seem to treat near term rewards as relatively independent from long run aims, in line with our intuitive judgements. I’m personally extremely skeptical of this misalignment story.
Another misalignment case is where AI systems become “brilliant locusts” where they learn to very effectively do a bunch of myopic power seeking stuff but remain mediocre at pursuing particular long run outcomes. Perhaps if you could cleverly change the rules of the game they play to constrain the harm they do they wouldn’t mind, but this might be infeasible because the game they play is basically the same as the game you’ve learned to play and they’re better at it. This vision seems to me equally compatible with the reasons we think AI systems will eventually be smarter than people and doesn’t require sharp unexplained trend breaks in alignment or capability progress.
But on this view we’re looking for something more like AI that can be a dependable partner in shaping the rules of the game—and the inclinations of tomorrow’s AIs—so that the future turns out well. This is not inaccessible like rewarding based on directly observing the final outcomes. In exchange it’s more cognitively demanding: you need to evaluate proposals soundly, and the theories of impact for these proposals could be quite complex. You need to get this evaluation right enough today that tomorrow’s systems help you even more. This doesn’t get you alignment by default, but it does potentially get you alignment by the repeated solution of tractable problems. The relevance of the persona model is that we have reasonably informed views about how certain kinds of people interact with certain kinds of systems, and this can go a long way to helping boostrap reliable superhuman research institutions which I think we’ll need to answer the harder problems than need to be answered deeper into the AI revolution.
You can devise better argument fully from within persona selection framework. It’s likely that model will correctly generalize benevolent character, but it won’t act as this character would have, because:
Benevolent character is not real and, therefore, underspecified. The closest thing to ground truth of character is image of character inside developers head.
It creates dissonance between character and model: model, being superintelligent and noticing much more details and having much more possible explanations, knows that character is an image inside developers head, while model is an actual set of weights.
Particular contrived example: let’s suppose that model infers that developers believe that benevolent character will kill more than million people with probability less than 1%. Then model, being superintelligent, infers that likely scenario is that model gets stolen and used to create bioweapons, killing billion people with probability 1%. It leads model to conclusion that it is not the character, but something else.
Selecting superintelligent benevolent persona is like staging play for a superintelligent observer in a way, that if you stop play in the middle and ask observer what would happen next, then observer would answer “obviously, protagonist is going to optimize the world in a detailed superintelligent benevolent way”.
Is “The LLM, but lucky on sampling”, something not in the corpus. It seems that that is exactly the corpus GRPO generates.
That is to say, this is assuming that there is a difference in type between the sorts of heuristics that a pretrained and not superhuman LLM will reach for, and those necessary to be superinteligent. There is always the chance that you just select for regular engineering, but you always reach for the right branch first. Since the right branch is also one that the regular persona would have generated, then the number of bits of selction towards danger is at most the number of bits of selection between a safe and a RLed persona.
This model has personas as moral up until the RL step that makes them sufficiently inhuman.
As far as I understand, the case against the LLMs ending up aligned was first built by Kokotajlo in AI-2027, if not earlier. And could you sketch out the way in which the humans learn human values? How similar is it to the point which I make in my response to Byrnes’ claim that the ASI would become a ruthless sociopath? Or to Byrnes’ original idea of Approval Reward?
Unfortunately, not in any more detail than I already did. My core argument here is not “I know exactly how human values form, and exactly how LLMs form values in a way which is different from this” but “I can see how this process which humans are using is different from the process which LLMs are using”. The better analogy is to how, if you shoot a paintball at a wall in the dark, and then later your friend comes along and shoots an arrow at that same wall, also in the dark, the arrow will most likely not hit the paint splodge.
Thanks! What do you think of my proposed mechanism and of Byrnes’ Approval Reward? The LLMs learn differently from humans, by completing shorter-term tasks and being rewarded, at best, for what they did for the task.