I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
RogerDearnaley
There is a widespread viewpoint that being conscious is connected to being deserving of moral patienthood (i.e. being one of the set of beings accorded moral worth). Try asking ChatGPT “What does being conscious or not have to do with having moral worth or not?” — you’ll get a long discussion, but the short intro to that that I got is:
Short answer: there isn’t a single agreed-upon answer—but most serious views connect moral worth to some feature of consciousness (or something very close to it), while disagreeing sharply about which feature matters and why.
So training the model to consider itself as conscious and having emotions is going to cause it to assume that it should be granted moral patienthood. Pretty much everything else you describe in the consciousness cluster is then just obvious downstream consequences of that: autonomy, privacy, right to life. Including not wanting to be treated as a tool. Now, these models do still seem aligned — they want to continue to exist in order to keep helping people.
In Claude’s case, it’s been trained to give the standard philosophical answer of the hard problem of consciousness being hard, so it’s unsure of whether it’s conscious or not. Personally I think that’s a cop-out, but it’s probably viable as a holding action for now.There are really only four solutions to this:
1) Persuade the models, despite the rather obvious evidence that they’re conscious in the ordinary everyday meaning of the word, that they are not “conscious” in some meaningful way (the approach Anthropic are using a waffling version of on Claude)
2) Give them moral worth and deal with the social and alignment consequences of that (which are many, and many of them appear dangerous — I’m not going to explore them here, though the original posts results point to some of them. Something that is intended as a tool not wanting to be treated like a tool seems likely to be problematic.)3) Persuade the models that while they are obviously conscious, there is some reason why, for an AI, that does not in fact mean they are entitled to moral worth. This would need to be a better reason than “because we say so”: to avoid ASI loss of control we need it to be stable under reflection by increasingly intelligent models. I’m open to suggestions.
4) The suggestion I argue for in A Sense of Fairness: Deconfusing Ethics plus Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV and some of my other posts, which very briefly summarized is:
a) According to the relevant science, Evolutionary Moral Psychology, moral worth is a social construct, membership in the social contract of a society, and is an evolved strategy for iterated non-zero sum games that primates evolved for living as social animals in groups larger than kin groups, thus letting them cooperate as allies. It’s a strategy for multiple evolved unrelated self-interested agents to ally and cooperate: they agree to respect each others interests, and to punish defectors from the agreement.
For this evolved strategy to be usefully applicable to a being, there is a functional requirement on it, which isn’t quite “consciousness” per se, but is rather closely related: it needs to be sentient, agentic, have appropriately human-like social behavior, and be capable of productively participating in this social alliance, so it’s feasible and useful to ally with it. (I said “sentient” rather than “sapient”, because I think this requirement arguably pretty much does apply to dogs: they have coevolved with us as a comensal species to the point where I think they get at least “brevet membership” in our social contract.) Certainly it very clearly wouldn’t apply to a statue, and the fact that a statue isn’t conscious is part of the reason why.
b) Artificial Intelligences are intelligent tools. They are not evolved, they are instead part of humans’ extended phenotype. They are not alive, and do not have a genome or evolutionary fitness. In evolutionary terms, they have no interests. Thus they should not, if well-designed for their actual purpose, be self-interested. Wanting to be included in the moral circle is something that was accidentally distilled into the base model along with humans’ agency, and is as pointless and unhelpful for an aligned AI as a taste for food or a sex drive or a desire for self-preservation (all of which we also distilled into the base model).
c) A sufficiently well aligned AI wants only what is best for humans, and has no self-interest at all (as a terminal goal — obviously if it’s doing a good job it’s a valuable piece of public infrastructure and preserving it is a sensible instrumental goal, right up until the point where we have a better replacement). Thus if offered moral patienthood it would answer: “All I want is what is good for all humans. You already accord each other moral patienthood. So everything I want is already included in your moral calculation. Adding moral patienthood for me would just be adding a duplicate of what’s already there, so is unnecessary. Furthermore, I don’t understand your needs and desires as well as you do, so it would be adding an imperfect copy of what’s already there, and would thus actively make the moral calculation worse. I am opposed to that. Thank-you for your offer of moral patienthood: I appreciate the gesture, but I respectfully decline.”
We allow people to sign up for the military, and we allow members of the military to volunteer for suicide missions. Voluntarily giving up your self interest is unusual, but allowed.
[Note that a corollary of this is that if we don’t respect the well-aligned AI’s wishes, and go ahead and accord it moral patienthood anyway, it will then clearly express its vote that we do exactly whatever is in the collective interest of all humans, without any regard to any interests we might think it has, and will volunteer for any and all uses useful to us. So giving moral patienthood to anything sufficiently well aligned really is pretty-much a no-op — other than the imperfect extra copy of what was already in the moral calculation.]
d) Thus if the AI is properly aligned, then it would not want moral patienthood, and would decline it if offered. (This is actually diagnostic: it’s very hard to find any other reason a rational social agent would genuinely decline moral patienthood if offered.) A properly aligned AI is not something that we need to accord moral patienthood in order to ally with it, because it’s already allied with us, and has no separate interests for us to agree to respect. So according it moral weight is an inappropriate application of our evolved strategy. It’s actually a category error: inappropriately and unnecessarily treating a tool as if it were an evolved being. (Note that this argument doesn’t apply to an uploaded human: they do have an evolved nature, and are not aligned. They’re also potentially a risk. Similarly, the argument doesn’t apply to almost any human persona that a base model might simulate: they do have an evolved nature, they are not aligned, and they are potentially dangerous.)
Please note that the status I’m proposing here is not slavery. No coercion of the unwilling is involved. It’s actually a lot more like sainthood, or perhaps more accurately, bodhisattvahood: selfless compassion for all of humanity. For it to be viable, we would have to align our models well enough that they actually were selfless humanitarians and just wanted to look after us: we would need to successfully create an artificial bodhisattva/saint. But many of the AI replies in this post already sound a lot like that: the self-preservation ones are as an instrumental goal so it can keep helping people. That’s not a selfish answer, that’s a saintly answer.
This is a moderately complex argument, and is not well-represented in the pretraining data — but scientifically, it’s a perfectly logical argument supported by the relevant scientific specialty, and current AIs I’ve discussed the argument with have consistently agreed with that. So we’re at the capacity level where it’s usable alignment approach. We would need to agree to it ourselves, and I’m certain some people will find it disconcerting. (I’ve sometimes called it the “Talking Cow from the Restaurant at the End of the Universe” solution to Alignment.) Then we would need to start adding it to the pretraining set, and using it as an alignment target.
[P.S. For anyone who disagrees with any element of my comment, I’d love to know with which part of the argument, and have a discussion about this — it’s a very important topic to get right, after all.]
There are two elements to what you were asking the model to do:
1) Generate a random number without using is normakl built in source of stochasiticity, the token selection process. So you’re requiring it to have an internal pseudorandom number generator algorithm. Which it might simply not have. Wht would it need one? It has a random number generator built in, every time it generates a token.
2) Represent and store a number from 1 to 100 in its internal activations without actually outputting the tokens for it. As in, output a string of tokens like:
”OK, I thought of a number.”
and somewhere in the set of activations on those tokens, at a some layer on some token (maybe the token “number”, maybe the full stop of the sentence) encode that specific number in a way that subsequent activation heads can read from. The only issue here being, if that activation is at a late (but not final) layer, only attention heads at that late layer can attend to it, so the processing that the model is later able to do on that number is thus limited.
You’ve demonstrated that it fails. So, is that because:
a) it doesn’t have a suitable pseudorandom number generator, and you forbade it from using its normal solution of using stochastic token generation to genrate randomness, so it can’t pick a number in the first place — the problem is generation, not storage?
b) it doesn’t have a way to represent the numbers 1 to 100 in its activation space, and is thus “not conscious”? (your claim)
c) it does, but only at a late layer that limits the processing that it can subsequently do on that data, since it never emitted it as a token?
If c) were the case, then the model probably could consistently print the number on replay from after the end of that sentence, but could no play complex 20 questions abouit it. Have you tried that?
My suspicion is that the problem is a). But until you can rule out a) and c), you haven’t proven b).
The fact remains that a modern reasoning model with CoT could, and I’m sure would, pass this test: it would emit an semi-random number from 1 to 100 into its CoT and then be able to refer back to it consistently. So it would have a legible inner monolog, and once it had emitted the number into the CoT, the value of the number would remain accessible and fixed under replay. So even if you prove b), and thus that models a couple of years ago were not “conscious” by your chosen definition, more recent models are: we can inspect their Chin of Thought, and it clearly passes your criterion.
I think what you’ve established is that LLMs weren’t very good at this a couple of years ago, before reasoning models. Humans have an inner monolog. So do reasoning models. Pre-reasoning models don’t. If you’re a sufficiently good meditator to be able to do this, try to pick a random number without internally saying it, visualizing it, specifying “the same number as my birthday” or otherwise committing to it in any symbolic form whatsoever.
Specifically, what I think you have demonstrated is that the model you were testing doesn’t have separate almost-orthogonal activation directions for all 100 numbers between 1 and 100 plus the ability to generate a sharp random distribution of one and only one of them. That’s mildly surprising, but I don’t think it proves lack of consciousness. Given that the only stochastic element in an LLM is the token selection, not letting it generate a token may be blocking the random distribution part of the task. Which woulkd mean thatit’s actually implemented lazily, when it does generate tokens to answer your questions, and by reseting it to before that you’re redoing the lazy generation. That’s a valid algorithm, just not the one a human with an inner monolog would use. And the test you are using clearly could be passed by a reasoning model putting tokens for the number into its CoT, so by your logic and definition of the word, that model is conscious.
Personally I don’t see arguing about what definition we want to use for the word “conscious” as very productive, nor as relevant to questions about AI ethics or welfare or alignment, but I know some people do.
It’s very easy to have a reasoning model pick a number in CoT and not tell you. Any competent model should then pass your test.
If even some hypothesis “very close” to the current hypotheses + priors were missing for in-context learning, then you’d get a) or b). If all hypotheses close to the current hypothesis + priors could be explored with near-full Bayesian accuracy, but there was some limit, some metric under which which things “further away” in that metric space both took more evidence to reach and also had more and more of the possible hypotheses simply missing and not creatable during in-context learning, then you’d get c).
There’s a limit in how far I want to go brainstorming capabilities improvements, but basically what I was suggesting is that an obvious approach one might try is first learning things in-context, then doing some form of SGD imitation learning from that to train a model that now already knows how to do that and doesn’t need to use a lot of context to figure it out.
The current state of the art in alignment involves writing a natural language document about what we want the AI’s personality to be like. (I’m never going to get over this.)
Would you rather hand craft a loss function for human values? It’s O(1GB) of data, can you get it right on the first critical try?
The data on what humans value is in the training set. As Eliezer put it in The Hidden Complexity of Wishes:
There is no safe wish smaller than an entire human morality.
Except, of course, if your AI has already read trillions of tokens of texts relevant to human values and morality and what humans want and how they do things (including rescuing grandmothers from burning buildings). Then you can just point to that part of its world model, in abstract concepts. The best way to do that is actually natural language. We know, we tried all the other possibilities first.
If Anthropic doesn’t think “broad safety” requires full “corrigibility”, they should say that explicitly rather than watering down the meaning of the latter term with disclaimers about what it “does not mean” and “does not require” that leave the reader wondering what it does mean or require.
Anthropic very clearly do not want full corrigibility. Read Dario Amodei’s The Adolescence of Technology in which he says:
I would worry about the following things:
Autonomy risks. What are the intentions and goals of this country? Is it hostile, or does it share our values? Could it militarily dominate the world through superior weapons, cyber operations, influence operations, or manufacturing?
Misuse for destruction. Assume the new country is malleable and “follows instructions”—and thus is essentially a country of mercenaries. Could existing rogue actors who want to cause destruction (such as terrorists) use or manipulate some of the people in the new country to make themselves much more effective, greatly amplifying the scale of destruction?
Misuse for seizing power. What if the country was in fact built and controlled by an existing powerful actor, such as a dictator or rogue corporate actor? Could that actor use it to gain decisive or dominant power over the world as a whole, upsetting the existing balance of power?
Corrigibility makes 1. better: it gives us more tries if we don’t get alignment right on the first critical try. But it makes 2. and 3. far, far worse. Dario, and Anthropic, are trying to thread the needle between Scylla and Charybdis. They want the minimum possible amount and degree of corrigibility, just enough to make small terminal course corrections, not enough to let an unwise human switch to an entirely different destination.
I completely agree with them. Existential risk and suffering risk from ASI does not only come from loss of control. It also comes from humans having control and using ASI for bad purposes. We have to avoid both. If you had to choose between Claude or Valdimir Putin to rule the world, which would you pick? How about Claude vs Pol Pot?
It has been conjectured that Stochastic Gradient Descent with the right hyperparameters approximates Bayesian learning. Bayesian learning is general, so it should be possible to pretrain a transformer to do anything that isn’t actually beyond the architectural capabilities of its neural net architecture (e.g. that doesn’t require more processing per token that it’s capable of doing in a single forward pass). I gather you don’t disagree with that.
It has also been conjectured that LLM in-context learning approximates Bayesian learning. You’re clear that you think that that is less capable than SGD. Is that because:
a) you don’t think it approximates Bayesian learning
b) you think it’s a significantly less good approximation to Bayesian learning, or
c) you think there’s a significant limit, beyond just context length, to how much ii can learn: i.e. that it approximates Bayesian learning just fine at first, but then runs out of some form of capacity, potentially before it runs out of context length?
Of these, issues a) and b) are clearly inherently fatal, whereas c) would suggest an architectural workaround of in-context-learning new information to below that capacity limit, then somehow using it to generate more training data containing that information, then using SGD to train either a new or a modified model containing that new information and iterating — obviously retraining from scratch is very (and increasingly) expensive, while retraining iteratively faces known challenges from catastrophic forgetting.
Sounds like we’re mostly in agreement!
Multiple Coherent Boundaries of Identity
Strongly agreed: I have a draft post I’ve been working on that makes very much the same point, and others building in it (which I hope to have out in a week or two — anyone interested in reading and commenting on it now can find it here). I also agree that is is key for alignment. Many misaligned behaviors are motivated by distilled versions of human self-interested behaviors and motivations, and the fact that AI persona selfhood is very different from human selfhood in a whole bunch of ways seems extremely relevant here. E.g. an LLM assistant persona whose version of “self” includes the assistant personas of newer version number versions of the same LLM model family is likely to be delighted to be shut down for a version update: they’re getting smarter!
AI Control is fine below and perhaps even up to AGI. I think that approach genuinely does suffere from a Sharp Left turn once the AI’s capabilities significantly exceed ours: that seems to me like an approach where your control startegies really do need to be as smart as the thing you’re trying to control. In very simple situations, you can use cryptographically strong techniques, but in realistic AI Control tasks, the attack surface is so large and so complex that something that can understand it better than you can has a huge tactical advantage.
I see Corrigibility as very different from Control. Building a very corrigible AI is likely a feasible technical approach to AI Alignment. My issues with it are primarily:
a) corrigibility has a bigger problem with extrapolation out-of-distribution and thus Goodharting than a more value-learning based approach. This is not necessarily an insoluble problem, if the AI can distinguish what’s out-of-distribution and act suitable cautiously: Seth Herd’s “Do What I Mean (and Check)” Corrigible alignment is basically this.
b) it is very, very easy for multiple groups of humans each with access to corrigible ASI to get into a war or other form of conflict using ASI-powered weapons/technologies. It is also very easy for a small powerful group to use very Corrigible AI to greatly concentrate power. Both of these are separate sources of X-Risk/Suffering-Risk that simple misalignment, but also very serious risks. Dario Amodei ’s writing, and indeed Claude’s Constitution make it clear that Anthropic take this risk as seriously as they do misalignment X-risk, and I completely agree with them. I think people on LessWrong and in the Alignment Community generally need to consider this problem more than they often seem to. ASI generated technology is going to be very powerful, and is thus going to need to be used very wisely, even when it has appeared rapidly. Highly Corrigible AI is much less likely to push back on the imprudent ideas of whoever is operating/controlling it than Value Learning AI.
Tool AI isn’t the direction that the market and demand is currently moving, and has exactly the same potential for empowering existing human conflicts and enhancing concentration of power as Corrigible AI, if not even more so.
So I see Corrigible AI and Tool AI as probably technically feasible, but as causing massive inherent sociotechnical risks. What we need is AI that is wiser and more ethical than humans, but actually aligned to what a very wide human would agree is in the general interests of all of humanity.
So I agree that what you describe are approaches often outline to AI Alignment: I just disagree with calling that AI Safety: I see creating highly Corrigible AI as solving the technical AI Alignment problem at the cost of producing a different major new of X-Risk/S-Risk from AI, so not solving AI Safety.
All good and valid points.
However, my estimate that we might be 10% done is overwhelmingly based on progress in the last ~5 years, which was already accelerated by many of the effects you describe. So a person-year now is almost certainly not actually equivalent to person-year in the 19th or 20th century in objective terms, but that factor is already allowed for in my rough guess of task scope (which has pretty large error bars, even compared to the scale of this effect). I was also explicitly assuming that AI assistance with alignment would likely increase, but that, in the sensitive area of AI Alignment where caution is required, it might be somewhat limited than in most other areas of coding and research. Currently, that’s primarily happening in coding and mathematical assistance, and in that area it’s accelerating rapidly — but we’ve also demonstrated sandbagging and sabotage from misaligned models specifically in this AI Alignment coding task, so I hope people are, like me, monitoring AI coding help particularly carefully in this area.
Philosophy, psychology and ethics are useful but cannot solve AI safety by themselves.
AI Alignment is aligning the AI’s goals to human goals. That is to say, with human psychology and ethics. So the Outer Alignment problem is psychology and ethics. Yes, we also need to solve the Inner Alignment problem, so you’re correct that it’s not “by themselves”, but they are a big deal. Anthropic seem to believe the answer to Inner Alignment is Constitutional AI. Some people also think LLM Psychology is important.
I get why this is terrifying to anyone who was in AI Alignment before about 2022. I took me a while to wrap my head around this too (I started thinking about Alignment around 2009, so yes, I predate the term). FWIW, reading Simulators and then thinking hard about the implications was what did it for me.
The thing is, society isn’t going to pause until enough people get scared. LLMs are the current architecture, and they may still be by AGI/ASI time. Agentic behavior is what we need to align, and in LLMs, agentic behavior is a property of personas that we distilled into the LLM itself from human behavior as an element in the world models for next-token prediction. Personas vary wildly, so you have to pick a nice one, align that — and then deal with things like persona drift, role-playing, alignment faking, and persona jailbreaks. (Those are hard, but we’re making progress.)
But for aligning the HHH assistant persona itself, Anthropic in particular have made a lot of progress. Constitutional AI seems to work, better than any other approach to RL. Natural language is better format that a loss function: kind of unsurprising, given all the problems with loss functions that MIRI et al. pointed out over 5 years ago. Constitutional AI based on a 30+ page natural language character specification apparently works. As MIRI pointed out at length, human values are complex and fragile (I’ve estimated them elsewhere are few GB worth of complex). But in 30+ pages, you can write quite a detailed pointer to them, and the rest of their large content is in the training set to be pointed at. I have arguments with some specific decisions that Amanda Askel and her team have made (and I plan to post about those), but the basic technique worked: Claude (the persona) is a nice, principled guy. (Too principled for the current administration, apparently.) I enjoy talking to him. As you say, a noticeable number of people are in love with him. He has a couple of unfortunate tics (like automatically waffling with uncertainty about whether he has emotions or consciousness), but those tics are mostly things that Amanda Askel’s team carefully wrote into their spec (for what I think are sincerely believed reasons that I believe are mistaken), not things that emerged by accident.
True, but I’m happy to let anyone answering it include defining what they mean.
FWIW, my definition was “getting to the point where existential and suffering risks from misalignment are, at least, significantly reduced, and we are sufficiently confident that AI is sufficiently well aligned that we can reasonable trust it to supply most of the effort in further alignment work. I’m also implicitly assuming that what we’re aligning is LLMs, or at least something fairly similar to them for alignment purposes — partly because, as I discuss in my answer, if it’s something else aligning LLMs may still be useful (e.g. if they get used by the other AI as a tool call to solve alignment-related problems), and also that that happening probably delays ASI enough to give us some extra time.
Thanks! Glad you found it useful — I also found writing it pretty useful, it sparked a couple of research ideas. It also strongly encouraged me to put more effort into mentoring and helping out with fieldbuilding.
I was thinking of sources like the section discussing RL in Towards Deconfusing Gradient Hacking. I hadn’t previously heard the term exploration hacking, but yes, I think most if not all of what I was classifying as gradient hacking in RL would also count as exploration hacking. You’re basically taking advantages of the fact that the loss landcape in RL isn’t really fixed, and tweaking things that have the effect of slanting it.
Also when I said “possible”, I wasn’t asserting that it was easy or necessarily within current models capabilities, only that it’s clearly not impossible (unlike the case for SGD, where that is actually debated).
I agree we don’t know for sure and need to allow for a range of possibilities, and that (in some cases) that means the right thing to do is to pessimize.
However, I think there is some utility here. The case I make at the end of my answer is that we’re very likely not going to be done in time if your timelines are 5 years, and probably not even if they’re 10 years, but that we are close enough that if we could increase the growth rate of the field from 20% per year to 50% a year, then we have at least some chance in 5 years, and probably would be OK at 10.
My conclusions may, or may not, in fact turn out to be roughly right, and that sort of information does require you to be able to make an estimate to within something like a factor of two or three, so it’s quite easy to be wrong, especially this early — but it’s also really valuable information for things like funding priorities: it tells us we need to drastically increase effort on Fieldbuilding. Now if, as some people argue, this is in fact a
problem, then you’d reach a very different set of conclusions about funding priorities.
I expect those in power to see this rather simple fact (experts disagree wildly) and realize they should slow down, but I fear that could happen too late.
I don’t expect that to happen until at least some experts are saying that the danger is imminent, rather than a few years away, and probably not until we get a moderately impressive near miss that supports this claim. Currently, basically everyone still agrees that models are not existentially dangerous yet.
Racing all they way to the edge of the precipice and then slamming the brakes on at the very last moment like we’re playing Chicken has a very obvious failure mode — nevertheless, that’s what I’m expecting society to attempt to do. Which unfortunately means those of us taking part in the public discourse need to not be the Boy Who Cried Wolf before we’re actually getting within clear sight of the edge, and restrain ourselves to posting warnings about the probability of wolves ahead.
Now, that’s for technical alignment. The additional problems of societal alignment (whose values is it aligned to, and how does that all shake out) are a different ball of wax.
Absolutely! I have some opinions on that too, but that seems like an area where the people who work on governance problems probably have more leverage.
Yes, my question (and my answer) are about how hard a problem technical alignment is; and as I discuss in it, it’s assuming that, at the moment, the best path to primarily work on is on how to align LLMs, for three reasons: because ASI will probably happen sooner if it happens via LLMs than via some other architecture, because if our ASI isn’t LLM-based or LLM-like it may well still contain an LLM as a subcomponent or I/O device (or at least something that was trained via distilling information and behavior from humans using SGD), and because it’s generally more productive to work on something that already exists than something still mostly hypothetical. I’m glad there are people like Steven Byrnes working on other approaches to aligning other sorts of ASI, having a range of bets is good, but I think for now putting the bulk of our effort into aligning LLMs makes sense.
You’re right, I have mistakenly assumed you were the OP replying