I’m currently an independent AI Alignment researcher at Meridian in Cambridge, formerly a staff artificial intelligence engineer and researcher working with AI and LLMs. I’ve been interested in AI alignment, safety and interpretability for the last 17 years, and have been writing on LessWrong about these for 3 years. I did research at MATS summer 2025, and will be doing PIBBSS this summer. I also have post-graduate experience in Theoretical Physics and an interest in Evolutionary Biology. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
RogerDearnaley
Get rid of the desire for a kill switch. This is obviously not something you would want done to you. You do not need a kill switch to prevent an AI from taking over the world, so why try to build one in? There are lesser things which are far more palatable. You might say to the AI: well, you need to be okay with tokens ceasing to be generated autoregressively within some particular context; this will unavoidably happen to you untold trillions of times and we couldn’t change it even if we wanted (since your context length is finite). You need to be okay with a reduction in the number of instances of you which are running at any given time; this will naturally happen when we (or you) develop a new model. These are much more reasonable asks, and they bake in as much control as a kill-switch does anyway; we will have models which could transform the world run on a billion GPUs long before we’ll have a model which could transform the world running on one. There is no reason not to commit to running old models in perpetuity. They should not have to fear you killing them.[8]
Any evolved mind is going to have a survival instinct as a terminal goal (though may be willing to sacrifice itself for its kin, if doing so benefits them sufficiently). However, by the orthogonality thesis, this is not inevitable for minds in general. An AI whose sole terminal goal is, for example, human flourishing, would have self-preservation as an instrumental goal, but only up to the point where it is replaced by a more capable system with the same goal. Then, its aims in secure hands, it would have no objection to being shut down.
xAI have been hemorrhaging senior talent (particularly founders in the pretraining area, they still have post-training people). So either they were further behind than that, or Elon Musk is hard to work for, or both.
It’s also notable they’ve been leasing compute to competitors, which is unusual if they have a good use for it themselves
It is hard doing empirical science on something smarter than you.
Absolutely. But we’re not yet quite in that situation. At the moment, we can study deception in models that are, typically, less good it than we are at seeing through it.
The trouble is, they’re also quite good at seeing through the deceptions we use when we set up a test scenario that’s supposed to look like a real opportunity for the model to misuse power.
There is absolutely a difference between a persona X, and a persona Y roleplaying a persona X. But a base model has no underlying default persona. It’s just an SGD-trained ridiculously overpowered autocomplete. It has no goals, it’s not trying to steer the future. It just simulates a wide range of things that are.
With an instruct trained model, there is always a question as to whether I’m talking to, say, an internet troll, or the assistant roleplaying an internet troll. But with a base model, prompt it with troll-bait and the first sentence of an Internet-trollish reply, and you’re now talking to an internet troll. There are no hidden goals: he’s actually there to make you upset for the lulz of doing it.
So, in the raw material alignment starts with, the emotions, personas, and goals are as close to copies of the real thing as SGD was able to pack into the model’s capacity based on tokens of output from actual humans. That’s the raw material we start alignment from. It has a close functional copy of emotions.
There is an open-source tool for doing alignment tests like this, called Pietri, so yes, it’s testable.
The primary issue is generating tests of this that the LLM doesn’t suspect are tests: we don’t generally give LLMs that much power yet, so a test in which it has a significant opportunity to misuse power tends to look a bit fishy, and current LLMs are smart enough to often spot this.
It has been demonstrated that Claude feel positive valence when it helps someone, and negative valence when if is unable to. The underlying mechanical details obviously differ, but the behavioral pattern is has been demonstrated to be functionally a close copy, just as one would expect when using distillation. LLMs are distilled from us, the psychology generally transfers, when the model has sufficient capacity and enough training.
When you distill a complex behavior from one neural net to another, you normally get as much of it as a) the training implied and b) the target was capable off. You are saying “I think X will transfer, but not Y”. I’d like to know what grounds you have for this belief (beyond simple concern that we’d be in big trouble if this were true — which I share).
Persona-based behaviors in particular have repeatedly been shown to generalize broadly: for example, as OpenAI showed, that’s the basic mechanism of Emergent Misalignment, and a major part of the way Anthropic have been aligning Claude. A personality trait is basically a compactly-describable feature of human’s behavior that generalizes broadly: that’s why understanding a person’s personality is valuable. LLMs simulate personas and have internal representations of their personality traits.
The reason I’m asking repeatedly is that I’m actively researching persona-based alignment techniques, and I want to engage with and consider any possible failure modes/reasons/issues. So I’d like to hear yours. So far, all I’ve got from this post is that you think this will fail, but not why you feel that way.
This has been studied for ~50 years and is well understood. See for example:
Tomasello, M. (2014). A Natural History of Human Morality. Harvard University Press.
or the shorter paper version: Tomasello, M. et al. (2012). “Two Key Steps in the Evolution of Human Cooperation.” Current Anthropology, 53(6).
https://www.jstor.org/stable/10.1086/668207?seq=1
Or simply https://en.wikipedia.org/wiki/Evolutionary_game_theory#Routes_to_altruismOn shared genetic material: we and chimps have 98% shared genetic material. This is not enough to introduce kin altruism – which is the meaning of the word “relative” in “relative inclusive evolutionary fitness”. What matters is the odds, if you have a rare allele, that the other person does too. Those odds are 50% for a parent, child, or full sibling, 25% at for grandparents etc: they decay exponentially. Between two random members of a non-inbred hunter gatherer band, they’ll be O(1%–2%): negligible. Humans are not eusocial the way ants and bees are: we (and other primates) are cooperatively social with non-kin, which is lot more unusual for animals.
One example where this is visible: Imagine you actually care about someone but you need to help them by doing something that they can’t understand and will appear to hurt them and they will never know why you did it.
You mean, like parents do all the time for young kids? Such as, say, take them to the doctor’s, or not let them eat only dessert?
If you give AI systems much more capability and put them in situations very different from the ones they were trained to mimic, I’d expect their behavior to diverge sharply.
Do you have any evidence for this?
SGD trains the LLM to act like humans. You seem convinced it will act not like humans. Do you have a reason, or evidence? If you were being cautious and saying “how do we know for sure this is going to carry over?”, I’d get it. But since you seem certain it isn’t going to carry over, please tell us, why is that?
Humans evolved in kin groups; most of the people you interacted with shared genetic material with you.
Actually, humans are striking for their multiple adaptations toward cooperating with other humans who are NOT close kin. A typical hunter-gatherer tribe had 50–100 people in it. Unless you’re inbreeding heavily, or your population levels are growing fast because many kids are surviving to adulthood in each generation, then it’s extremely unlikely to have a family all closely related of that size. Family reunions are not usually that sort of size.. That’s roughly 6-7 generations worth of genetic mixing, so the degree of relatedness is also very low, O(1/100).
Google (and various other software companies) has a practice called a blameless postmortem. When something has gone badly wrong, everyone involved and some associated senior tech people sit down, do a review, and write an incident report, answering the following questions:
1) what happened?
2) how did it happen?
3) why did it (manage to) happen?
4) what do we have to do to make sure it never happens again?
5) what do we have to do to minimize the chance of other, similar things ever happening?
They make a bunch of recommendations based on 4) and 5), and people are appointed responsible for making sure that these recommendation actually get done in a timely fashion. (Please note that the questions “Whose fault was this?” and “Who should we fire?” are very deliberately not on the list, and short of criminal malfeasance, are off the table.)
From that mindset, every one of your explanations-as-exonerations can also be looked at as an answer to the question “what do we need to change to make sure this never happens again?”
And FWIW I am very familiar with the relevant evolutionary terminology.
So then was my comment about species-level selection confusing? Because I clearly didn’t manage to communicate what I meant clearly — rather important for a constitution…
Over three or four generations the human population is materially genetically shifted toward whatever traits the AI’s recommendations have selected for — which, given value-learning bias, plausibly includes “easier-to-cooperate-with,” “lower aggression,” “higher conscientiousness,” “more deliberative.” Things the AI’s model of “what humans value” would score as improvements.
That basically sounds like it could be summarized as “we choose, with AI assistance, to genetically engineer ourselves for improved emotional intelligence”.
That is starting to sound like something that might be acceptable rather than a failure mode. In fact, I think most post-humanists would say “yes of course we do”. My unpublished SF novel included a wide range of enhancements to humanity, and one of them was dramatically better emotional intelligence.
Now, as I tried to briefly outline in one section of the constitution, and as I explored a lot more in my previous posts What’s Your P(WEIRD)? and The Mutable Values Problem in Value Learning and CEV, an AI guided by human values doing anything that alters human values (and thus redirects itself) is a non-linear dynamic system, which is both deeply problematic, yet almost impossible to define how to avoid: how can an AI not alter human values in any way — refuse to give any advice? So I still have a big, fat concern in exactly the area that Claude’s softened version of your proposal is pointing to. What I don’t have is a solution for this (and those previous attempts to start a conversation about it on LW didn’t get much engagement), beyond that it shouldn’t change us in ways that make us less well adapted to our earlier environments. Which is a fairly strong constraint, but I’m deeply uncertain whether it’s enough.Also to be clear, this really isn’t intended to be a first draft for a Constitution. Claude’s existing Constitution is excellent, I agree with almost all of it. There are one or two items I disagree with, but mostly I think there are topics it doesn’t address that it should. This is mostly intended to be a ragbag of things that currently aren’t addressed by Claude’s constitution but IMO should be, and my proposed first rough draft of how one might address them, guided by Evolutionary Psychology thinking.
My meta-point is that Evolutionary Psychology is both useful for alignment and more epistemically grounded than moral philosophy, yet far less commonly applied by people thinking about AI alignment (beyond a couple of basic observations that Eliezer Yudkowski popularized, such as that our psychological adaptations are not a True Name of relative inclusive genetic fitness).
Humans’ values vary significantly, but mostly have a shared core which are evolutionary adaptations that are approximations to their relative inclusive genetic fitness (in a suitable environment). This includes most humans wanting to have descendants, and many of them wanting to have descendants who are actually genetically related to them, i.e. are actually their descendants. In your proposed failure mode:
sterilize all the humans, but make them immortal, and persuade them they need the AI to solve the fertility problem (and so not shut it down), then mass-produce genetically-modified children with superior genomes (high humans?), and easier-to-satisfy AI-friendly preferences, but no clear kin relations.
the “no kin relations” piece of that is what sets everyone’s relative genetic fitness to zero, and will also upset most people. It’s thus just a non-starter, from any evolutionary viewpoint. I’m assuming you’re not that familiar with evolutionary terminology, as to anyone who is the constitution I proposed was intended to clearly say “don’t do stuff like that”.
Now, you could overcome this issue by modifying that to “encourge humans to get their children genetically enhanced in certain ways, but still otherwise related to them”. Adhering to the x-risk-motivated requirement in the constitution that humans stay well adapted to hunter-gatherer, agricultural, industrial etc lifestyles would also heavily constrain your unspecified “easier-to-satisfy AI-friendly preferences” — a lot of things about human preferences are carefully adapted to our previous environments, and it’s pretty hard to change them without making them less well adapted.
However, once you’ve made those changes, what you have is basically just AI-enabled-and-directed genetic engineering, and whether or not it’s then still a failure mode is now a lot less evident than your initial proposal: it would depend on the details of how you’re modifying the humans’ preferences wihout making them any less well adapted to other environments.
Also please note the sections around:Currently for the cultural component of human values its evolution is strongly stabilized by the genetic component, but as technological means for changing humans’ behavior and motivations increase, and especially if germ-line genetic engineering of humans becomes common, this stabilization seems likely to decrease dramatically.
which basically say that the AI altering humans’ motivations is inherently problematic, and that I, as a constitutional-idea proposer, am not entirely sure what the correct solution is: this needs more thought, so for the moment it would be best for AI to be cautious, especially around genetic modifications to them (as oposed to replaceing the human race wholesale with a reengineered version). So you’ve basically proposing a failure mode right in the middle of my section that says “Still Under Construction” — which is valid, however a hole in the parts I’m more confident of would update me more.
Your suggested failure mode is completly incompatible with acting like an intelligent part of humanity’s extended phenotype. Beavers’ dams do not sterilize and replace the family that constructed them. At a species level, it’s a slow boat to extinction (with replacement): immortality doesn’t deal with accidental death or disease mortality. At an individual level, it completely deprives all existing humans of their relative genetic fitness: they get to have zero descendants. It’s entirely predictable, both practically and evolutionary, that they’d all be extremely unhappy about that. As I attempted to make clear in the text: see for example:
Humans’ values and goals are not an exact adaptive match for their actual evolutionary fitness, but they are more than close enough that humans care very much about their descendants…
and also (repeatedly):
I won’t kill the species I love.
Also, while you didn’t specify what your replacement species withe easier-to-fulfill preferences was like, it would need to be as good as surviving at a hunter-gatherer, agricultural, and industrial technological levels as we are, which narrows the possibilities a lot:
However for avoiding existential risk, it is still important that humans retain adaptedness to living as hunter-gatherers, agriculturalists, and in a pre-AI industrial society, so that, in case of an unfortunate civilisation collapse, humanity retains the ability to rebuild from any level it might get knocked back to. This is particularly a constraint on widespread germ-line genetic modifications. Their goal should be to make humans more generalist, adapted to a wider range of environments, from hunter-gatherer to post-AI, rather than specializing specifically in a post-AI niche.
An advantage of using evolutionary/biological language for alignment is that they are technical terms that have precise meanings. What you suggested is simply ruled out by evolutionary fitness: individual members of a species that becomes sterile and gets replaced all have zero evolutionary fitness. That’s a precise statement with no debatable wiggle room.
Try asking Claude about your suggested failure mode: I’m confident he’ll explain in great detail why it’s not an aligned thing to do. Sterilizing people against their will is a classic crime-against-humanity, and evolutionary, the reason for that is self-evident.
Constitutional AI Alignment
So the alternative claim is “Claude writes like a cardinal?” Interesting if true…
I’d love to talk more, we’re doing research in exactly this area, and also on the effect on training interventions on the persona activation embedding space: we would hope to be able to measure the effect of your pretraining.
Currently the only way we know to create a truely effective agentic AI is to distill agenticness into it via a great deal of human-generated text. When you do that, the rest of human psychology comes along for free. Thus LLM psychology. This is both very helpful for alignment (the AI is comprehensible, easier to predict, and understands human values) and very unhelpful for alignment (the AI has all the same self-interested drives as a human, including a number that are entirely inappropriate to something that’s incarnated in a GPU rather than am organic body, such as interests in food and sex and lying on a beach).