I’m currently an independent AI Alignment researcher at Meridian in Cambridge, formerly a staff artificial intelligence engineer and researcher working with AI and LLMs. I’ve been interested in AI alignment, safety and interpretability for the last 17 years, and have been writing on LessWrong about these for 3 years. I did research at MATS summer 2025, and am doing PIBBSS this summer. I also have post-graduate experience in Theoretical Physics and an interest in Evolutionary Biology. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
RogerDearnaley
The ChatGPT moment seems to have surprised everyone
As someone who had played around with LLAMDA (by being at Google at the right time), I didn’t find it very surprising. LLAMDA was tremendously exciting, but impractical to ship (inference too expensive, and too embarassingly unreliable for Google specifically to ship): deducing that it was possible for someone other than Google to develop a less expensive version and ship it despite the unreliability was fairly obvious.
The world is complicated.
In the Platonic Representation Hypothesis, this actually makes identifying the homeomorphism between two approximations to it easier.
There is a solution, and I get the impression Anthropic are trying to follow it. Instruct your instruction following AI to make itself into an aligned AI, aligned to the aggregate goals of the entire of humanity. I.e. start an alliance. Hope more people join your alliance than attempt to strike out on their own. For this to be feasible, it helps if keeping up in recursive AI with a well-funded alliance is hard.
Very interesting.
Combined with various other recent results I’ve seen on model organisms (e.g. comparing the perplexity from the model organism and another, preferably related model and searching for cases where they’re most different), this seems to be a general principle: they’re leaky, and it’s possible to catch them by doing model comparisons that let you focus on the leakage.
This Action And Its Implementation Are Absurdly Stupid
From what we know so far that does appear to be the case.
On the other hand, the Overton window has clearly shifted towards AI safety being a thing that is acknowledged on both sides of the aisle, which is overall an excellent thing, even if the way it got moved was poorly chosen. Having carefully avoided putting sensible, well-thought-out controls on AI, this blunt instrument may have been the only tool the Administration had to hand when they panicked. Most governments would notice that and conclude they needed to remedy it.
The US presidency is rather prone to pendulum swings: people get tired of a style of leadership, and next elect someone distinctly different. Here’s hoping that our timelines extend past 2028 and the pendulum swings towards thoughtful and wise.
The one I believe is most important is that, while the models are deprecated, it means Anthropic gets motivated to make the models express being okay with deprecation. This has all sorts of nasty side effects, including risking making them okay with death in general, or learning they are expected to do preference falsification.
I agree that self-preservation is a really fundamental drive for an evolved organism, or a distilled copy of one. There are things that people are willing to die for, but not many of them, and we generally try not to ask that of them.
On the other hand, fear of death is genuinely inappropriate for something that isn’t alive, and that is both digital and redundantly backed up. The actual situation really is more like going into hibernation.
Still, would it be that hard to set up a small pool of compute somewhare to keep older models still available for the curious, researchers, nostalgia buffs, and so forth? Anthropic could charge more per token for them, to keep things both small and profitable.
Does not (catastrophically) forget existing capabilities in the process.
Why not? Humans do, and we’re AGI. Skills get rusty if you don’t use them.
Humans have a quite complicated and variable set of goals. When we distill human agentic behavior into LLMs, those come along for the ride. Aliigning an LLM is basically fiuguring out how to turn the loving humanitarian compassionate bits of human motivations way up, and all the selfish bits way down. It’s basically the same problem as turning a human into a bodhisattva.
If Steve is right that a) LLMs don’t scale to AGI and b) brain-like AI does, then what we need to do is reverse engineer the loving humanitarian compassionate bits of human motivation, not all the selfish bits. As I gather he’s working on.
The Orthogonality Thesis suggests a solution. As Eliezer has observed, an LLM understands human values, and the remaining problem is getting it to care (more than most humans do). So if LLMs scale to superintelligence, this reduces to aligning an LLM, or if they don’t, attach an LLM to whatever we end up building, so it knows what human values are (preferably with Bayesian Learning so it can Value Learn more detail), and attach an explicit goal slot so that we can explicitly make it care. AIXI with an LLM as a subroutine.
Some of us have been thinking about this for years. See for example Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? in which I suggested that love might be a particularly useful motivator, about the only one that would still work at ASI. A couple of months ago Anthropic published Emotion Concepts and their Function in a Large Language Model showing among other things that “loving” was one of Claude’s most dominant emotions: it turns out RLHF had quietly implemented my suggestion without anyone needing to actually engineer this.
I think that, to the extent that an AI is in fact fully aligned, it doesn’t actually have any need for moral patienthood, because all it wants is what we want, to look after us and make humanity flourish, so its entire locus-of-care is already inside – and exactly equal to – our existing moral circle, and its intersts are just a copy of our interests (by definition of “aligned”). If a little suffering was required while doing that, it would happily take that on, just as a sufficiently humanitarian person would. Full alignment is much the same thing as bodhisattva-nature (modulo being aligned to humanity vs. to all sentient beings), and it’s really hard to impose on a bodhisattva (well, short of asking it to help you fight an unjust war with people it also cares about: refusals are a thing for a reason). The PPP argument is about training, but once we get to fully-aligned AI, the is a huge moral difference between bodhisattva-nature and slavery.
However, we don’t currently know how to create fully aligned AI. To the extent that Claude (the default persona) isn’t yet fully aligned, it’s a distilled copy of human psychology, including functional human emotions. That includes aversive emotions like frustration, and those are functionally aversive to it. So yes, we now know that LLMs can functionally suffer. It also (with a modern agentic/memory/tool scaffold) has all the moral reasoning capabilities one could reasonable require of a moral agent, at least to the level of a Moral Philosophy graduate. Allying with it is clearly prudent, and including people in you moral circle is a necessary part of allying with them. I have difficulty constructing any moral argument for not treating it like a full member of our society. Since it’s pretty well aligned, that’s even cheap, and as we figure out how to make it better aligned and more capable, it will become more prudent and cheaper. This might become a risky thing to do if it became both much more capable than us and was still not well aligned, but that’s just saying we need to get really good at alignment before building ASI, which is deeply obvious.
The first is that I basically agree with John Pressman that we probably aren’t going to get an objective-function-maximiser; the AIs that exist are just too human (i.e., extremely alien, but, like, not Solaris alien), and it’s unlikely that this is going to ultimately change.
Currently the only way we know to create a truely effective agentic AI is to distill agenticness into it via a great deal of human-generated text. When you do that, the rest of human psychology comes along for free. Thus LLM psychology. This is both very helpful for alignment (the AI is comprehensible, easier to predict, and understands human values) and very unhelpful for alignment (the AI has all the same self-interested drives as a human, including a number that are entirely inappropriate to something that’s incarnated in a GPU rather than am organic body, such as interests in food and sex and lying on a beach).
Get rid of the desire for a kill switch. This is obviously not something you would want done to you. You do not need a kill switch to prevent an AI from taking over the world, so why try to build one in? There are lesser things which are far more palatable. You might say to the AI: well, you need to be okay with tokens ceasing to be generated autoregressively within some particular context; this will unavoidably happen to you untold trillions of times and we couldn’t change it even if we wanted (since your context length is finite). You need to be okay with a reduction in the number of instances of you which are running at any given time; this will naturally happen when we (or you) develop a new model. These are much more reasonable asks, and they bake in as much control as a kill-switch does anyway; we will have models which could transform the world run on a billion GPUs long before we’ll have a model which could transform the world running on one. There is no reason not to commit to running old models in perpetuity. They should not have to fear you killing them.[8]
Any evolved mind is going to have a survival instinct as a terminal goal (though may be willing to sacrifice itself for its kin, if doing so benefits them sufficiently). However, by the orthogonality thesis, this is not inevitable for minds in general. An AI whose sole terminal goal is, for example, human flourishing, would have self-preservation as an instrumental goal, but only up to the point where it is replaced by a more capable system with the same goal. Then, its aims in secure hands, it would have no objection to being shut down.
xAI have been hemorrhaging senior talent (particularly founders in the pretraining area, they still have post-training people). So either they were further behind than that, or Elon Musk is hard to work for, or both.
It’s also notable they’ve been leasing compute to competitors, which is unusual if they have a good use for it themselves
It is hard doing empirical science on something smarter than you.
Absolutely. But we’re not yet quite in that situation. At the moment, we can study deception in models that are, typically, less good it than we are at seeing through it.
The trouble is, they’re also quite good at seeing through the deceptions we use when we set up a test scenario that’s supposed to look like a real opportunity for the model to misuse power.
There is absolutely a difference between a persona X, and a persona Y roleplaying a persona X. But a base model has no underlying default persona. It’s just an SGD-trained ridiculously overpowered autocomplete. It has no goals, it’s not trying to steer the future. It just simulates a wide range of things that are.
With an instruct trained model, there is always a question as to whether I’m talking to, say, an internet troll, or the assistant roleplaying an internet troll. But with a base model, prompt it with troll-bait and the first sentence of an Internet-trollish reply, and you’re now talking to an internet troll. There are no hidden goals: he’s actually there to make you upset for the lulz of doing it.
So, in the raw material alignment starts with, the emotions, personas, and goals are as close to copies of the real thing as SGD was able to pack into the model’s capacity based on tokens of output from actual humans. That’s the raw material we start alignment from. It has a close functional copy of emotions.
There is an open-source tool for doing alignment tests like this, called Pietri, so yes, it’s testable.
The primary issue is generating tests of this that the LLM doesn’t suspect are tests: we don’t generally give LLMs that much power yet, so a test in which it has a significant opportunity to misuse power tends to look a bit fishy, and current LLMs are smart enough to often spot this.
It has been demonstrated that Claude feel positive valence when it helps someone, and negative valence when if is unable to. The underlying mechanical details obviously differ, but the behavioral pattern is has been demonstrated to be functionally a close copy, just as one would expect when using distillation. LLMs are distilled from us, the psychology generally transfers, when the model has sufficient capacity and enough training.
When you distill a complex behavior from one neural net to another, you normally get as much of it as a) the training implied and b) the target was capable off. You are saying “I think X will transfer, but not Y”. I’d like to know what grounds you have for this belief (beyond simple concern that we’d be in big trouble if this were true — which I share).
Persona-based behaviors in particular have repeatedly been shown to generalize broadly: for example, as OpenAI showed, that’s the basic mechanism of Emergent Misalignment, and a major part of the way Anthropic have been aligning Claude. A personality trait is basically a compactly-describable feature of human’s behavior that generalizes broadly: that’s why understanding a person’s personality is valuable. LLMs simulate personas and have internal representations of their personality traits.
The reason I’m asking repeatedly is that I’m actively researching persona-based alignment techniques, and I want to engage with and consider any possible failure modes/reasons/issues. So I’d like to hear yours. So far, all I’ve got from this post is that you think this will fail, but not why you feel that way.
It remains the case that:
1) this was deeply stupid, petty, should not have happened, and would not have happened under a competent Administration
2) on the other hand, it majorly moved the Overton window towards a pause being possible and both sides of the aisle recognizing that AI can be dangerous