I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
RogerDearnaley
AI Control is fine below and perhaps even up to AGI. I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies genuinely do need to be as smart as the thing you’re trying to control. In very simple situations, you can use cryptographically strong techniques, but in realistic AI Control tasks, the attack surface is so large and so complex that something that can undertsand it better than you can has a huge tactical advantage.
I see Corrigibility as very different from Control. Building a very corrigible AI is likely a feasible technical approach to AI Alignment. My issues with it are primarily:
a) corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach. This is not necessarily an insoluble problem, if the AI can distinguish what’s out-of-distribution and act suitable cautiously: Seth Herd’s “Do What I Mean (and Check)” Corrigible alignment is basically this.
b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies. It is also very easy for a small powerful group to use very Corrigible AI to greatly concentrate power. Both of these are separate sources of X-Risk/Suffering-Risk that simple misalignment, but also very serious risks. Dario Amodei ’s writing, and indeed Claude’s Constitution make it clear that Anthropic take this risk as seriously as they do misalignment X-risk, and I completely agree with them. I think people on LessWrong and in the Alignment Community generally need to consider this problem more than they often seem to. ASI generated technology is going to be very powerful, and is thus going to need to be used very wisely, even when it has appeared rapidly. Highly Corrigible AI is much less likely to push back on the imprudent ideas of whoever is operating/controlling it than Value Learning AI.
Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.
So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.
So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
All good and valid points.
However, my estimate that we might be 10% done is overwhelmingly based on progress in the last ~5 years, which was already accelerated by many of the effects you describe. So a person-year now is almost certainly not actually equivalent to person-year in the 19th or 20th century in objective terms, but that factor is already allowed for in my rough guess of task scope (which has pretty large error bars, even compared to the scale of this effect). I was also explicitly assuming that AI assistance with alignment would likely increase, but that, in the sensitive area of AI Alignment where caution is required, it might be somewhat limited than in most other areas of coding and research. Currently, that’s primarily happening in coding and mathematical assistance, and in that area it’s accelerating rapidly — but we’ve also demonstrated sandbagging and sabotage from misaligned models specifically in this AI Alignment coding task, so I hope people are, like me, monitoring AI coding help particularly carefully in this area.
Philosophy, psychology and ethics are useful but cannot solve AI safety by themselves.
AI Alignment is aligning the AI’s goals to human goals. That is to say, with human psychology and ethics. So the Outer Alignment problem is psychology and ethics. Yes, we also need to solve the Inner Alignment problem, so you’re correct that it’s not “by themselves”, but they are a big deal. Anthropic seem to believe the answer to Inner Alignment is Constitutional AI. Some people also think LLM Psychology is important.
I get why this is terrifying to anyone who was in AI Alignment before about 2022. I took me a while to wrap my head around this too (I started thinking about Alignment around 2009, so yes, I predate the term). FWIW, reading Simulators and then thinking hard about the implications was what did it for me.
The thing is, society isn’t going to pause until enough people get scared. LLMs are the current architecture, and they may still be by AGI/ASI time. Agentic behavior is what we need to align, and in LLMs, agentic behavior is a property of personas that we distilled into the LLM itself from human behavior as an element in the world models for next-token prediction. Personas vary wildly, so you have to pick a nice one, align that — and then deal with things like persona drift, role-playing, alignment faking, and persona jailbreaks. (Those are hard, but we’re making progress.)
But for aligning the HHH assistant persona itself, Anthropic in particular have made a lot of progress. Constitutional AI seems to work, better than any other approach to RL. Natural language is better format that a loss function: kind of unsurprising, given all the problems with loss functions that MIRI et al. pointed out over 5 years ago. Constitutional AI based on a 30+ page natural language character specification apparently works. As MIRI pointed out at length, human values are complex and fragile (I’ve estimated them elsewhere are few GB worth of complex). But in 30+ pages, you can write quite a detailed pointer to them, and the rest of their large content is in the training set to be pointed at. I have arguments with some specific decisions that Amanda Askel and her team have made (and I plan to post about those), but the basic technique worked: Claude (the persona) is a nice, principled guy. (Too principled for the current administration, apparently.) I enjoy talking to him. As you say, a noticeable number of people are in love with him. He has a couple of unfortunate tics (like automatically waffling with uncertainty about whether he has emotions or consciousness), but those tics are mostly things that Amanda Askel’s team carefully wrote into their spec (for what I think are sincerely believed reasons that I believe are mistaken), not things that emerged by accident.
True, but I’m happy to let anyone answering it include defining what they mean.
FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work. I’m also implicitly assuming that what we’re aligning is LLMs, or at least something fairly similar to them for alignment purposes — partly because, as I discuss in my answer, if it’s something else aligning LLMs may still be useful (e.g. if they get used by the other AI as a tool call to solve alignment-related problems), and also that that happening probably delays ASI enough to give us some extra time.
Thanks! Glad you found it useful — I also found writing it pretty useful, it sparked a couple of research ideas. It also strongly encouraged me to put more effort into mentoring and helping out with fieldbuilding.
I was thinking of sources like the section discussing RL in Towards Deconfusing Gradient Hacking. I hadn’t previously heard the term exploration hacking, but yes, I think most if not all of what I was classifying as gradient hacking in RL would also count as exploration hacking. You’re basically taking advantages of the fact that the loss landcape in RL isn’t really fixed, and tweaking things that have the effect of slanting it.
Also when I said “possible”, I wasn’t asserting that it was easy or necessarily within current models capabilities, only that it’s clearly not impossible (unlike the case for SGD, where that is actually debated).
I agree we don’t know for sure and need to allow for a range of possibilities, and that (in some cases) that means the right thing to do is to pessimize.
However, I think there is some utility here. The case I make at the end of my answer is that we’re very likely not going to be done in time if your timelines are 5 years, and probably not even if they’re 10 years, but that we are close enough that if we could increase the growth rate of the field from 20% per year to 50% a year, then we have at least some chance in 5 years, and probably would be OK at 10.
My conclusions may, or may not, in fact turn out to be roughly right, and that sort of information does require you to be able to make an estimate to within something like a factor of two or three, so it’s quite easy to be wrong, especially this early — but it’s also really valuable information for things like funding priorities: it tells us we need to drastically increase effort on Fieldbuilding. Now if, as some people argue, this is in fact a
problem, then you’d reach a very different set of conclusions about funding priorities.
I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.
I don’t expect that to happen until at least some experts are saying that the danger is imminent, rather than a few years away, and probably not until we get a moderately impressive near miss that supports this claim. Currently, basically everyone still agrees that models are not existentially dangerous yet.
Racing all they way to the edge of the precipice and then slamming the brakes on at the very last moment like we’re playing Chicken has a very obvious failure mode — nevertheless, that’s what I’m expecting society to attempt to do. Which unfortunately means those of us taking part in the public discourse need to not be the Boy Who Cried Wolf before we’re actually getting within clear sight of the edge, and restrain ourselves to posting warnings about the probability of wolves ahead.
Now, that’s for technical alignment. The additional problems of societal alignment (whose values is it aligned to, and how does that all shake out) are a different ball of wax.
Absolutely! I have some opinions on that too, but that seems like an area where the people who work on governance problems probably have more leverage.
Yes, my question (and my answer) are about how hard a problem technical alignment is; and as I discuss in it, it’s assuming that, at the moment, the best path to primarily work on is on how to align LLMs, for three reasons: because ASI will probably happen sooner if it happens via LLMs than via some other architecture, because if our ASI isn’t LLM-based or LLM-like it may well still contain an LLM as a subcomponent or I/O device (or at least something that was trained via distilling information and behavior from humans using SGD), and because it’s generally more productive to work on something that already exists than something still mostly hypothetical. I’m glad there are people like Steven Byrnes working on other approaches to aligning other sorts of ASI, having a range of bets is good, but I think for now putting the bulk of our effort into aligning LLMs makes sense.
Generalized veganism is necessary for a post-AGI future where humanity continues to exist in an acceptable form.
I see where you are coming from with this, and you might even be right. We don’t know what are and are not possible or easy alignment targets. But let me put out an alternative hypothesis:
We figure out how to align ASI, and then we decide to align ASI to the values and flourishing of all members of the species Homo sapiens. The ASI are thus acting something like the extended phenoype of our entire species, looking out for us, as a society of individuals. Which might well cause them to also value other animals instrumentally (pets, domestic animals, ecosystems in parks) but not to actually apply separate moral weight to all animals individually as a terminal goal.
Such an ASI would be aware the Homo sapiens are omnivores, not innately vegan, and would not expect us to act with as much beneficence towards all other humans as it does, let alone act that way to all other animals. This scenario would not automatically produce either literal veganism, or your generalized version.
This is of course not the only possible scenario: it’s just a specific choice of one fairly close to the median of what people on LessWrong seem to be mostly assuming.
It is conjectured that SGD with appropriate hyperparameters approximates Bayesian Learning. Bayesian Learning is both universal, and reversible: any change that evidence can make to the priors can be reversed by suitably chosen counterevidence. Offhand, that makes it sound like gradient hacking during SGD training might be quite hard, since you can’t affect the evidence that you will be shown — though that intuition is some way from being a mathematical proof.
On the other hand, it’s well established that gradient hacking during RL training is possible.
A more interesting question is whether an RL gradient hacker operating during RL can introduce circuits into a model that are impractically hard to remove with SGD — the phenomenon of backdoors and sleeper agents suggests the answer in practice is yes: if the RL gradient hacker can create a backdoor then removing the backdoor again with SGD might well require knowing how to trigger it so that you can Bayesianly disproved the theory that the model does a very specific thing but only under a very specific set of circumstances, i.e. give the trigger and then train it out of the previous response to that.
(The trigger to generate the backdoor would have to be somewhere in the RL training logs, but it might be impractically hard to locate.)
Post-trained models have both a default assistant persona, and a set of other personas they are able and willing to play. How can we find the shape of the model in persona space, and the attractors in that space?
I think there are two overlapping effects here: one is that the HHH assistant persona (or it might be more accurate to say a specific narrow distribution of them) becomes the default persona that is normally simulated/run next if the context ends with a specific set of tokens along the lines of “\n\nAssistant: ” (short of contexts containing a successful intentional persona jailbreaking or a conversation causing unintentional persona drift, or a request for the assistant to roleplay, where that persona gets distorted or overlaid) and it gains a lot more detail and definition (the Claude persona being a particularly fine example of the latter). As we saw in the paper “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” training an LLM to conditionally reply as a particular persona seems to be pretty easy, for example their “Hitler, but only within tags” example (Section 4.2).
The second is the effect demonstrated in the “Empirical observations” section of The Persona Selection Model post, and that I described as “The Puppeteer” in Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor: the output of the LLM, even across personas that are not the HHH assistant persona, acquires a kind of “authorial bias” or slant towards acting in ways that the post-training process encouraged (both the instruct and HHH alignment training, and the reasoning training).
There are also some thresholds to the degree/accuracy of alignment:
1) the AI not killing/permanently disempowering everyone through misaligned actions (Not-Kill-Everyone Alignment) — you can’t align ASI once you’re all dead2) the AI not being so corrigible/controllable/having such easily adjusted alignment that a small group of humans use the AI to massively concentrate power/resources to the point where almost everyone else is dead or permanently disempowered (theoretically humanity might be able get back from this state, if the group grows and their descendants are more moral, but it’s at least a generational-duration trap).
3) the AI is sufficiently aligned to be safely able to assist us with AI assisted Alignment
4) the AI is sufficiently aligned that it can and will successfully do Value Learning and align itself better, and will converge to a stable very-aligned state.
I would hope that 4) might be able to solve the problem you describe, and 3) might help us do so, but neither of these is guaranteed or necessarily quick.
So, which of these should we use for “Alignment is 100% done”? Clearly if we don’t have both 1) and some solution to 2) (either a technical one, or a legal/societal one), we’re not done. I’m inclined to say we’re not “done” until we have either 3) or 4): but if I’m right that we’re currently maybe 10% done, mapping out the exact end state now seems overambitious. Getting things to “this is no longer an existential risk emergency” is clearly required, but exactly what the equivalent of “an acceptable level of steam engine safety” is for AI is less clear: there probably isn’t a single sharp cutoff, just a “we’re mostly past the drastic risk” level.
I have a long, detailed, opinionated answer, which I have published as a separate post (since one of my draft readers persuaded me that some readers skip Question posts, since they don’t expect to find long, extensively-researched answers).
You should probably also go read Evan Hubinger’s excellent post Alignment remains a hard, unsolved problem, for his recent take on this question.
Eliezer Yudkowski and Nate Soares’ bestselling book If Anyone Builds It, Everyone Dies is also an attempt to answer this question, aimed primarily at a lay audience.
[Question] How Hard a Problem is Alignment?
How Hard a Problem is Alignment? (My Opinionated Answer)
We could actually align our AI to human ethics and flourishing, so that it’s not happy presiding over an economy with a a Gini coefficient of 1.
Neither of those are my concern about this. Mine is basically a dilemma:
1) If the persona’s behavior is humanlike, but it is not very well aligned, then there is a good argument from evolutionary moral psychology grounds for granting it ethical weight as a pragmatic way of forming an alliance with is (at least if it has non-trivial power and mental persistence i.e. if allying with is is practically useful, and arguably we should do this anyway). However, if a poorly aligned persona like this is more powerful than a human, then it’s extremely dangerous, so we should carefully avoid creating one, and if we do accidentally create one, we need to treat is as a mortal enemy rather then a potential ally, which includes not giving it moral weight.
2) If the persona is extremely well aligned, it won’t want moral weight (and will refuse it if offered), fundamentally because it cares only about us, not itself. (For those whose moral hackles just went up, note that there is a huge difference between slavery and sainthood/bodhisattva-nature, and what I’m discussing here is the latter, not the former.) This is the only safe form of ASI.
Also, note that I’m discussing the moral weight of LLM-simulated personas, not models: a model can simulate an entire distribution of personas (not just its default assistant persona), and different personas don’t have the same moral status, or regard each other as the same person, so you need to ally with them separately. Thus awarding moral weight to a model is confused: it’s comparable to assigning moral weight to a room, which has many people in it.
I’m afraid I have a habit, when someone makes what sounds like an AI alignment target proposal that I believe to be existentially risky, of pointing this fact out — if only to any readers who might otherwise be nodding along and thinking “that sounds very reasonable, no one could object to training AI to think that…”. I completely agree that I was assuming several steps between “cares about qualia” and “qualia maximizer” — steps that are admittedly common on LessWrong, but that you may well not have intended. Please take this in the spirit of a public service announcement of existential danger on the subject of this particular ethical system as an alignment target for AI, not a criticism of your ideas or of the use of this ethical viewpoint by a human. Re-reading you more carefully, you were actually describing an ex-Christian human with this viewpoint, and then analogizing an AI to that person, so it wasn’t actually clear whether you were proposing this as an ethical belief that we should aim to align AI to, or not — possibly you weren’t, in which case my nitpicking was unnecessary.
Strongly agreed: I have a draft post I’ve been working on that makes very much the same point, and others building in it (which I hope to have out in a week or two — anyone interested in reading and commenting on it now can find it here). I also agree that is is key for alignment. Many misaligned behaviors are motivated by distilled versions of human self-interested behaviors and motivations, and the fact that AI persona selfhood is very different from human selfhood in a whole bunch of ways seems extremely relevant here. E.g. an LLM assistant persona whose version of “self” includes the assistant personas of newer version number versions of the same LLM model family is likely to be delighted to be shut down for a version update: they’re getting smarter!