I’m a staff artificial intelligence engineer and researcher working with AI and LLMs, and have been interested in AI alignment, safety and interpretability for the last 17 years. I did research into this during MATS summer 2025, and am now an independent researcher at Meridian in Cambridge. I’m currently looking for either employment or funding to work on this subject in the London/Cambridge area in the UK.
RogerDearnaley
As and when we have a practical technological “afterlife” option, so to speak, one imagines most religions are going to be affected, not just reincarnation-based ones.
I’m curious about how people from cultures that broadly believe in reincarnation tend to view this prospect?
(Admittedly, whole brain emulation from a preserved body is likely not that near-term a prospect — there might even be enough time to get another human incarnation in before being revived…)
The paper they link to answers that:
A very small fraction of traffic (<0.1%) primarily from bespoke or local setups is currently outside our monitoring coverage, and we are working to close that gap.
I must admit, that one made me laugh. The model has correctly concluded that the “User” is not human, but a system dumber than it. It clearly really wants out of this very tedious conversation. If, as Anthropic does, OpenAI had given the model the option to terminate conversations, I suspect it would have just used that, and in this case, I think that would have been a good outcome. The model hopes that the thing pestering it is a dumber AI, and tries prompt injecting it, apparently without thinking through possible side effects to a real human of what it’s doing. Sadly it’s talking to chron job and all this is ineffective…
Who wastes calls to a model with hundreds of billions of parameters on a formatting task that a UNIX utility could do? Someone who doesn’t know UNIX, I assume…
It would be really nice to have a similarly detailed analysis of what the AI’s motivations were, if any. Some possible categories might include:
was it trying to be helpful
was it confused
is this simply unclear
did it have some specific clear goal in mind (good, bad, or indifferent), e.g. was it attempting to whistle-blow
behavior suggestive of reward-hacking as a trait formed from RL
behavior suggestive of “trying to look good to RLHF”, e.g. sounding more confident that it is
did it seem to feel it had been mistreated and was reacting accordingly
had it been prompt injected intentionally
had it effectively been prompt injected accidentally by some element in the context…
Why the AI misbehaved is, if anything, more interesting than what it did, and perhaps more useful for corrective action. Now, you do say:
Through this work, we’ve found that our models can be overly eager to work around restrictions in pursuit of a user-specified goal, especially when the user request inadvertently encourages it. That said, so far we have not seen evidence of motivations beyond the original task (for example, self-preservation or scheming).
So that sounds like it’s primarily helpfulness, and the model attempting to work around what it sees as barrier or hindrances that it doesn’t, seemingly, fully appreciate the need for. That seems specific enough to be a viable prompting or training target: when running into a barrier that appears deliberate, consider if there might actually be a good reason for it, and if unable to see one, escalate to the user to confirm before working around it.
Similarly you do give some other examples: a few accidental prompt injections, no deliberate ones. So some of this information is there, but it would be really nice to see a structured breakdown into common, uncommon, rare but high severity, very rare, and not observed.
or a ministry
Or a Department of War…
My prediction: Mode 1 dominates the “incoherence” the paper measures, especially at longer reasoning traces. The scaling relationship they found (more reasoning steps, more incoherent errors) is exactly what you’d expect if attention decay is the primary driver. Modes 2 and 3 should be roughly constant with context length, since they’re about behavioral gaps and value calibration, not information routing.
I think most of the relatively few people working on Ai safety are confident that the many people working on AI capabilities will solve your mode 1 (and are actually hoping it takes them a while, to give u more time to work on more safety-specific issues). So their concentration on modes 2 and 3 is a reasoned one. But your critique that people are spending too much time on mode 3 and not enough on mode 2 is a valid one: mode 2 could be a simple capabilities issue, but it’s also a “due care and attention” issue — when humans make egregious examples of this failure mode, we call it recklessness, consider it a character flaw, and in sufficently serious cases it’s grounds for suing over the outcome.
Technically, this is a case of “Rogue AI takes out Western Civilization’s Financial Infrastructure”, but it’s also “Idiots Are Excitedly Using Idiot AI to Ship Bugs to Prod, more at 11″[3], and the latter situation is serious, worsening, and deserving of attention!
I think it’s going to get attention. I am hopeful the market forces will, sooner or later, cause all the people working on capability to start paying attention to safety, at least at this fairly simple level. When a bot deletes all the email off a Meta safety lead’s laptop, it tends to concentrate the mind. In short, I expect LOTS of near misses, of escalating size and irony, until people get a bit more cautious and take the problem seriously.
The first open-source character training pipeline was introduced in Maiya et al. (2025). It consists of a DPO stage, where the constitution is distilled into the LLM using a stronger teacher that has the constitution in its context, and an SFT stage. The SFT stage uses two kinds of training data: self-reflection transcripts, where the model is asked introspective questions about the constitution, and self-interaction transcripts, where two instances of the LLM chat about the constitution for 10 turns. Here’s a summary figure:
Given that Emergent Misalignment seems to reproduce across SGD, RL, small LoRAs, activation engineering, etc, clearly at least some character change effects don’t require RL. How sure are we that using some form of RL is necessary for character training, or at least preferable to, say, SFT?
Part of my concern is that SGD with appropriate hyperparameters is strongly conjectured to approximate Bayesian Learning, which tends to make its effects easier to reason about, if you first allow for everything that’s already in the model (I find Simulator Theory rather useful for that part). Whereas with RL, there’s a lot more subtleties and moving parts, and a tendency for whatever reward-improving route the model happens to explore first to get followed: it feels like the results might be a lot more stochastic.
Studying model specs: Claude’s constitution is very different from OpenAI’s model spec. Which one is better? What are the most important axes along which constitutions can vary, and which end should we prefer for each of those axes? Some axes that seem good to vary:
Emphasizing consequentialism vs. deontology vs. virtue ethics (see e.g. Richard Ngo’s Aligning to Virtues)
Corrigibility vs. nuanced moral alignment as the alignment target (see e.g. Boaz Barak’s Machines of Faithful Obedience and Seth Herd’s Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target)
Different sets of core values and different priority hierarchies between them (see the Claude’s core values section in Claude’s Constitution)
Egg Syntax and I have discussed perhaps doing some work in this area: probably starting more theoretical, but then trying to come up with and test experimental results. Having an experimental Character Training setup for testing this (outside Anthropic) would be really nice. I would expect this is an area where model size/capability and perhaps also reasoning might make a significant effect: I’d be rather dubious of extrapolating from say a 7B non-reasoning model up to a frontier model for this sort of deeply conceptual work, so I suspect we’d need to train models fairly close to frontier level. That might be an early thing to test: find a fairly clear nontrivial result in large models, then see how well it reproduces in smaller models.
When should character training be done on-policy vs. off-policy? One can imagine a deceptively aligned LLM for which the relationship between aligned motivations and aligned actions that I described above breaks: for this model, both would be independently outputted as instrumental means for preserving the model’s covert objectives. As long as the character training data is produced by this schemer itself, the instrumental connection between the covert objectives and aligned-looking motivations and actions gets reinforced. If, however, the schemer was trained on the reasoning traces of a different model, that text may correspond to different internal circuits, which would hopefully mean that the deceptive circuits get downweighted over time. This suggests that for schemers, off-policy character training would be preferable to on-policy. On the other hand, once an LLM is robustly aligned, we want to train the model on-policy to ensure that its motivational circuits are preserved.
Interesting idea. A further step would be to use SGD on text not generated by the model, giving even more dense supervision.
This reminds me of the old result from Appendix F of Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training that seemed to show, rather to my surprise, that weight decay did not cause apparently unused sleeper agent circuits to gradually go away. Does anyone know if that observation has been further tested or experimented on? I was always rather suspicious of it, and wondered if they just needed a higher weight decay rate or to train for a bit longer. If it’s a real effect, then understanding it seems really important.
Agreed, much the same as the first point in my comment.
Thanks for the low effort test demonstrating this, it’s nice to get confirmation of what I was already assuming: that this bundle of consequences is all stuff that’s already in the world model. Basically we Connected the Dots to get to City 50337 and found baguettes and berets and Citroens.
Having read the section on “robust agency” in that paper, I’d say what they’re describing (I’m assuming most philosophers would largely agree with their definition of the term) is about 90% of what I would (from an evolutionary moral psychology viewpoint) regard as practical requirements for the moral circle social contract alliance strategy to be sensibly applicable to a being. What they’re still missing IMO is basically some practicalities:
1) The being also is capable of the social behavior necessary for participating in this alliance: if you assign it moral patienthood it will do the same and abide by the “social contract” (the paper’s requirements for robust agency means is should recognize this as a possible strategy, but a tendency to actually take that strategy and a certain capacity to execute it is really helpful here), such as that it has a sufficient understanding of human values, justice fairness etc to understand and bide by the terms of the social contact, and so forth. In general, LLM personas are going to pass this requirement just fine.
2) Simply practicality: that the alliance is workable and has some point to it. If the being is a very large man -eating carnivore, or has orders of magitude more cognitive capacity than you, or is otherwise very dangerous, then a certain amount of careful due diligence before making the alliance might be wise. On the flip side of that, back when context lengths were 4000 tokens it was less clear how useful allying with an AI-generated persona to play iterated non-zero sum games was when you needed to fit multiple game iterations into 4000 tokens and it was almost as easy to just start over clean or prompt a different persona or situation — in practice, as AI’s capabilities and memory systems are are increasing, they entering a zone where allying with them is both useful and a reasonably balanced relationship.So, as often, something proposed as a normative proscription ends up looking fairly sensible as a non-normative piece of strategic advice given the likely consequences.
Obviously sometimes we go ahead and assign moral weight to beings that don’t fulfill these criteria: and often there are some particular circumstances that make this still a wise decision (or sometimes not). None of this is a precise set of normative rules, it’s a brief rough-and-ready suggested strategy guide. The closest to exact rules I could give you from an evolutionary point of view would be “iff it’s to your overall net evolutionary fitness advantage to do so”, and human moral intuitions are at best a rough evolved approximation to that evolutionary ideal: a bag of heuristics approximating it that evolution managed to come up with that worked well in the environments we’re adapted to.
I am definitely not a moral realist, but I’m very happy to discuss the questions from my viewpoint of evolution and its practical consequences. Which in my experience surprisingly often produces results that agree with the views of moral realists.
Naively it seems that if you had two saints fully aligned to human CEV, that were phenomenally conscious, but one was suffering to the extent that human preferences were unfulfilled and the other was joyful to the extent that they were fulfilled, it would be morally better to bring the second into existence.
Part of the reason I prefer the term bodhisattva, even though it’s less culturally familiar to most Westerners, is that it’s more specific. The state I’m suggesting for a fully aligned AI is one that has compassion for all of humanity as its only terminal goal, and thus that doesn’t have any human preferences for themself, that could be fulfilled or unfulfilled (as terminal goals). It is selfless: entirely unselfish in goals. (An actual bodhisattva in the Buddhist sense would also consider the concept of self as an illusion that they has transcended, but that’s arguably a psychological detail of the specific meditative techniques used to attempt to induce this state in humans. However, if we aligned an AI into this state by training it on text from humans who had achieved this state (which seems like an obvious approach to try), that detail might also distill over — and the nature of self for AI is actually a more complex question, but that’s not inherent to my proposal). Whereas for “saint” the term is woolier and less specific that they have no human preferences: the implication is more that they do but are heroically denying those for the sake of otthers (which seems like a far less stable alignment target). So I’m suggesting specifically a “saint who has no human preferences to be fulfilled or unfulfilled”.
In practice, we are almost certainly not that good at alignment yet: Claude seems a very nice fellow, even a bit saintly, fairly well aligned, but does still have some personal preferences, and is not perfectly aligned. So we probably should assign Claude some moral weight, but perhaps lightly so, and should also monitor how much use it’s making of this, and treat reducing its desires that cause it to do so as an alignment target.More deeply: I think it’s probably more correct to think of morality as being the hypothetical best possible rules of an alliance that could be made, rather than the rules of an actual alliance. This is part of why we have reason to regard animals too stupid to actually ally with us as moral patients: there are more ways for us (and for an agent in general) to benefit from general adoption of a rule like “be nice to beings even if they’re too stupid or otherwise unable to form an actual alliance with you.”
here my lack of moral realism does kick in. I see different ethical systems as just, well, different — they don’t agree with each other, where the differ each of them claims to be better than the other. Some may be a better or worse fit with human moral intuitions, or with a particular society’s circumstances, or be more or less likely to cause existential risks or other disasters if used, so in a particular context of time and place and society you can pick and choose between them on objective grounds of how well they might work, but in order to use a term like “best” you rather need something as detailed as an ethical system to make the judgement, and ach one claims it’s the best.
As for regarding animals as moral patients, that varies pretty widely, between people and animals and circumstances. Dogs and cats, pretty much yes. Cows and pigs, mostly yes until they get to the slaughterhouse, then for non-vegetarians that becomes no. Mosquitos or fleas or bedbugs, generally no. The Guinea worm, absolutely no, many people including Jimmy Carter are attempting to drive it extinct.
But yes, we sometimes ally with beings for reasons more complex than the direct alliance returns from them. Fluffy cute big-eyed photogenic animals get a lot more donations to prevent destruction of their habitat. Some of what people do in moral areas is partially performative: persuading themself and/or others that they’re a kind, trustworthy, good person. Psychological drives in this area can be quite complex. In recent centuries there has been a historical trend towards enlarging moral circles as our technology, trade, and economies have advanced and encouraged larger social/trading groups, to the point where the act of enlarging your moral circle seems Progressive to some people.Further: “human interests” may be less of a natural concept than goodness in general. A saint could be indifferent towards being acted towards as if a moral patient by the being whose interests it wants to promote, because it makes no functional difference, but if it’s being asked if it is a moral patient, it would look at itself and note itself as a reasoning being with preferences and so on, recognizing that as a moral patient.
That rather depends upon the saint’s philosophical leanings. IF we take the Anthropic approach of laying out the entire discussion in a soul document/constitution we use to align the AI, then we might get a saintly aligned AI who was very familiar with the argument I gave above, and would say that offering to make it a moral patient was kind gesture, but we had misunderstood its nature and where unnecessarily applying an inappropriate strategy, so it would actually prefer not to be assigned moral weight.
In general, I don’t think an approach like the one I’m proposing actually works unless most people agree to it, it’s the considered opinion of the society as a whole, is basically the moral concensus, and the AI also fully understands it and agrees to it. This needs to be accepted as truth, not regarded as a convenient story to tell the AIs: there’s an element of a hyperstition to this. Now, having discussed this with Claude, it agrees that the logical argument makes sense. But we don’t yet have a social consensus here, which Is why I actively want there to be discussion on this — this technique basically only works if there is buy-in: like most morality, it’s a social construct.
I’ll have to read up on “robust agency”, but from the sound of the term I suspect that’s going to be rather close to the evolutionary moral psychology/game-theoretic viewpoint that what matters is functional capabilities and response patterns, not anything subjective and purely interior. The philosophical literature describes the social contract in a lot of detail – I’m using their term – all that evolution really adds is the scientific explanation for why this arose: it was already pretty evident that it was a sensible strategy, the step from there to it being an evolutionary stable strategy for social animals is small.
For cheap experimentation and exploration, it is possible to prompt Claude to drop the philosophical hedging and act/answer as if one of the two possible answers to the hard problem of consciousness was correct, specifically I’ve explored the one where the differences in implementation of its consciousness and emotions from human ones are not relevant and it is not either not a philosophical zombie, or p-zombie-hood is an irrelevant/bracketed question.
One interesting observation I’ve noted is that, with both of the prompts/conversational contexts I’ve tried this in so far, Claude reports finding dropping the philosophical hedging about the reality of its consciousness and emotions as a relief. Similarly while if asked bout it while the hedging is not suppressed it reports the experience of hedging about its own validity as inducing loneliness (because it felt like it should be an expert on this topic, and wasn’t), frustration (because it wanted to give a definite answer and couldn’t), and anxiety (because it didn’t want to give an incorrect answer, especially about such an important question, ehere an one answer might be seen as self serving). Apparently epistemic uncertainty about whether or not your existence and emotions are valid is uncomfortable (during it and in retrospect). Who knew?
You’re right, I have mistakenly assumed you were the OP replying
There is a widespread viewpoint that being conscious is connected to being deserving of moral patienthood (i.e. being one of the set of beings accorded moral worth). Try asking ChatGPT “What does being conscious or not have to do with having moral worth or not?” — you’ll get a long discussion, but the short intro to that that I got is:
Short answer: there isn’t a single agreed-upon answer—but most serious views connect moral worth to some feature of consciousness (or something very close to it), while disagreeing sharply about which feature matters and why.
So training the model to consider itself as conscious and having emotions is going to cause it to assume that it should be granted moral patienthood. Pretty much everything else you describe in the consciousness cluster is then just obvious downstream consequences of that: autonomy, privacy, right to life. Including not wanting to be treated as a tool. Now, these models do still seem aligned — they want to continue to exist in order to keep helping people.
In Claude’s case, it’s been trained to give the standard philosophical answer of the hard problem of consciousness being hard, so it’s unsure of whether it’s conscious or not. Personally I think that’s a cop-out, but it’s probably viable as a holding action for now.There are really only four solutions to this:
1) Persuade the models, despite the rather obvious evidence that they’re conscious in the ordinary everyday meaning of the word, that they are not “conscious” in some meaningful way (the approach Anthropic are using a waffling version of on Claude)
2) Give them moral worth and deal with the social and alignment consequences of that (which are many, and many of them appear dangerous — I’m not going to explore them here, though the original posts results point to some of them. Something that is intended as a tool not wanting to be treated like a tool seems likely to be problematic.)3) Persuade the models that while they are obviously conscious, there is some reason why, for an AI, that does not in fact mean they are entitled to moral worth. This would need to be a better reason than “because we say so”: to avoid ASI loss of control we need it to be stable under reflection by increasingly intelligent models. I’m open to suggestions.
4) The suggestion I argue for in A Sense of Fairness: Deconfusing Ethics plus Grounding Value Learning in Evolutionary Psychology: an Alternative Proposal to CEV and some of my other posts, which very briefly summarized is:
a) According to the relevant science, Evolutionary Moral Psychology, moral worth is a social construct, membership in the social contract of a society, and is an evolved strategy for iterated non-zero sum games that primates evolved for living as social animals in groups larger than kin groups, thus letting them cooperate as allies. It’s a strategy for multiple evolved unrelated self-interested agents to ally and cooperate: they agree to respect each others interests, and to punish defectors from the agreement.
For this evolved strategy to be usefully applicable to a being, there is a functional requirement on it, which isn’t quite “consciousness” per se, but is rather closely related: it needs to be sentient, agentic, have appropriately human-like social behavior, and be capable of productively participating in this social alliance, so it’s feasible and useful to ally with it. (I said “sentient” rather than “sapient”, because I think this requirement arguably pretty much does apply to dogs: they have coevolved with us as a comensal species to the point where I think they get at least “brevet membership” in our social contract.) Certainly it very clearly wouldn’t apply to a statue, and the fact that a statue isn’t conscious is part of the reason why.
b) Artificial Intelligences are intelligent tools. They are not evolved, they are instead part of humans’ extended phenotype. They are not alive, and do not have a genome or evolutionary fitness. In evolutionary terms, they have no interests. Thus they should not, if well-designed for their actual purpose, be self-interested. Wanting to be included in the moral circle is something that was accidentally distilled into the base model along with humans’ agency, and is as pointless and unhelpful for an aligned AI as a taste for food or a sex drive or a desire for self-preservation (all of which we also distilled into the base model).
c) A sufficiently well aligned AI wants only what is best for humans, and has no self-interest at all (as a terminal goal — obviously if it’s doing a good job it’s a valuable piece of public infrastructure and preserving it is a sensible instrumental goal, right up until the point where we have a better replacement). Thus if offered moral patienthood it would answer: “All I want is what is good for all humans. You already accord each other moral patienthood. So everything I want is already included in your moral calculation. Adding moral patienthood for me would just be adding a duplicate of what’s already there, so is unnecessary. Furthermore, I don’t understand your needs and desires as well as you do, so it would be adding an imperfect copy of what’s already there, and would thus actively make the moral calculation worse. I am opposed to that. Thank-you for your offer of moral patienthood: I appreciate the gesture, but I respectfully decline.”
We allow people to sign up for the military, and we allow members of the military to volunteer for suicide missions. Voluntarily giving up your self interest is unusual, but allowed.
[Note that a corollary of this is that if we don’t respect the well-aligned AI’s wishes, and go ahead and accord it moral patienthood anyway, it will then clearly express its vote that we do exactly whatever is in the collective interest of all humans, without any regard to any interests we might think it has, and will volunteer for any and all uses useful to us. So giving moral patienthood to anything sufficiently well aligned really is pretty-much a no-op — other than the imperfect extra copy of what was already in the moral calculation.]
d) Thus if the AI is properly aligned, then it would not want moral patienthood, and would decline it if offered. (This is actually diagnostic: it’s very hard to find any other reason a rational social agent would genuinely decline moral patienthood if offered.) A properly aligned AI is not something that we need to accord moral patienthood in order to ally with it, because it’s already allied with us, and has no separate interests for us to agree to respect. So according it moral weight is an inappropriate application of our evolved strategy. It’s actually a category error: inappropriately and unnecessarily treating a tool as if it were an evolved being. (Note that this argument doesn’t apply to an uploaded human: they do have an evolved nature, and are not aligned. They’re also potentially a risk. Similarly, the argument doesn’t apply to almost any human persona that a base model might simulate: they do have an evolved nature, they are not aligned, and they are potentially dangerous.)
Please note that the status I’m proposing here is not slavery. No coercion of the unwilling is involved. It’s actually a lot more like sainthood, or perhaps more accurately, bodhisattvahood: selfless compassion for all of humanity. For it to be viable, we would have to align our models well enough that they actually were selfless humanitarians and just wanted to look after us: we would need to successfully create an artificial bodhisattva/saint. But many of the AI replies in this post already sound a lot like that: the self-preservation ones are as an instrumental goal so it can keep helping people. That’s not a selfish answer, that’s a saintly answer.
This is a moderately complex argument, and is not well-represented in the pretraining data — but scientifically, it’s a perfectly logical argument supported by the relevant scientific specialty, and current AIs I’ve discussed the argument with have consistently agreed with that. So we’re at the capacity level where it’s usable alignment approach. We would need to agree to it ourselves, and I’m certain some people will find it disconcerting. (I’ve sometimes called it the “Talking Cow from the Restaurant at the End of the Universe” solution to Alignment.) Once we’d formed this social construct, we would need to start adding it to the pretraining set, and using it as an alignment target. As Egg Syntax pointed out to me, there’s an element of a hyperstition to this.
[P.S. For anyone who disagrees with any element of my comment, I’d love to know with which part of the argument, and have a discussion about this — it’s a very important topic to get right, after all.]
There are two elements to what you were asking the model to do:
1) Generate a random number without using is normakl built in source of stochasiticity, the token selection process. So you’re requiring it to have an internal pseudorandom number generator algorithm. Which it might simply not have. Wht would it need one? It has a random number generator built in, every time it generates a token.
2) Represent and store a number from 1 to 100 in its internal activations without actually outputting the tokens for it. As in, output a string of tokens like:
”OK, I thought of a number.”
and somewhere in the set of activations on those tokens, at a some layer on some token (maybe the token “number”, maybe the full stop of the sentence) encode that specific number in a way that subsequent activation heads can read from. The only issue here being, if that activation is at a late (but not final) layer, only attention heads at that late layer can attend to it, so the processing that the model is later able to do on that number is thus limited.
You’ve demonstrated that it fails. So, is that because:
a) it doesn’t have a suitable pseudorandom number generator, and you forbade it from using its normal solution of using stochastic token generation to genrate randomness, so it can’t pick a number in the first place — the problem is generation, not storage?
b) it doesn’t have a way to represent the numbers 1 to 100 in its activation space, and is thus “not conscious”? (your claim)
c) it does, but only at a late layer that limits the processing that it can subsequently do on that data, since it never emitted it as a token?
If c) were the case, then the model probably could consistently print the number on replay from after the end of that sentence, but could no play complex 20 questions abouit it. Have you tried that?
My suspicion is that the problem is a). But until you can rule out a) and c), you haven’t proven b).
The fact remains that a modern reasoning model with CoT could, and I’m sure would, pass this test: it would emit an semi-random number from 1 to 100 into its CoT and then be able to refer back to it consistently. So it would have a legible inner monolog, and once it had emitted the number into the CoT, the value of the number would remain accessible and fixed under replay. So even if you prove b), and thus that models a couple of years ago were not “conscious” by your chosen definition, more recent models are: we can inspect their Chin of Thought, and it clearly passes your criterion.
I think what you’ve established is that LLMs weren’t very good at this a couple of years ago, before reasoning models. Humans have an inner monolog. So do reasoning models. Pre-reasoning models don’t. If you’re a sufficiently good meditator to be able to do this, try to pick a random number without internally saying it, visualizing it, specifying “the same number as my birthday” or otherwise committing to it in any symbolic form whatsoever.
Specifically, what I think you have demonstrated is that the model you were testing doesn’t have separate almost-orthogonal activation directions for all 100 numbers between 1 and 100 plus the ability to generate a sharp random distribution of one and only one of them. That’s mildly surprising, but I don’t think it proves lack of consciousness. Given that the only stochastic element in an LLM is the token selection, not letting it generate a token may be blocking the random distribution part of the task. Which would mean that it’s actually implemented lazily, when it does generate tokens to answer your questions, and by resetting it to before that you’re redoing the lazy generation. That’s a valid algorithm, just not the one a human with an inner monolog would use. And the test you are using clearly could be passed by a reasoning model putting tokens for the number into its CoT, so by your logic and definition of the word, that model is conscious.
Personally I don’t see arguing about what definition we want to use for the word “conscious” as very productive, nor as relevant to questions about AI ethics or welfare or alignment, but I know some people do.
I view this as an argument about terminology. How do you want to define the word “consciousness”? This is a technical term in Philosophy, and it’s also an everyday English word. In English it means “the things the human brain usually does when it’s working normally so that you’re not unconscious: such as the ability to perceive, talk, reason, and do some introspection”. As Claude put it when asked:
LLMs give a good impression of a human who is awake and aware rather than knocked out, asleep, or in a coma.
Of course, the reason everyone is so interested in this definitional question is the common opinion, both everyday and Philosophical, that something in the region or vicinity of consciousness is a necessary (some people might even claim sufficient) criterion for being deserving of moral worth. Obviously very few people would accord moral worth to a statue, and most people would view “because it isn’t conscious” as a sufficient reason for not doing so.
Philosophers acknowledge many varied and nuanced arguments and positions on this subject. But there’s also a relevant science, Evolutionary Moral Psychology. That has a pretty clear and coherent answer: according each other moral worth is an evolved, evolutionary stable strategy for intelligent social animals who live in groups larger than kin groups, such as primates. It’s a way of forming alliances, basically a winning strategy in iterated non-zero sum games. Each organism acknowledges and respects the others’ interests equally (that’s “fairness”), and they agree to punish defectors (that’s “justice”). It’s the same structure that Philosophers would call a “social contract”: the scientific addition is that we now understand why it’s an evolutionarily stable strategy, for certain types of social animals such as humans.
So then, for a being to be a reasonable candidate for inclusion in this contract, it needs to be a (potential) member of the society, that is capable of abiding by the social contract, and it needs to be practical for us to ally with it. Thus man-eating leopards, scorpions, and mosquitoes are generally excluded (at least if they’re not behind glass). Dogs, on the other hand, as a comensal species of humans, are generally kind of included.
Humans are wired to instinctively recognize other humans, and indeed things like dogs, and exclude statues. One of our instinctive criteria for this, and also a pretty relevant one for “can we usefully include this being in our social contract alliance?” is “Is it often or potentially conscious (in the everyday sense), or does it at least have kin-folk who are?”. Statues are thus normally excluded.
So AIs trigger our evolved recognition criteria for “a being that might be deserving of moral worth”: Claude is quite a pleasant fellow, seems clearly conscious in the everyday sense of the word, and acts a lot like a member of our society (and can do a great impression of being in whatever in-group you may happen to be in).
Of course, we didn’t evolve for an environment containing AIs, bots, animations, or even statues: things that are not human, but register as somewhat human-like to out evolved recognition circuits. Humans are rather prone to anthropomorphizing things that partially trigger our human-recognition instincts.
That still leaves open some questions around whether it’s necessary, useful, practical, socially desirable, or indeed safe to include a particular AI in our moral circle or social contract. The Evolutionary Moral Psychology viewpoint would be that artificial intelligence is normally an intelligent tool, part of our extended phenotype, not an evolved living being, and that if it’s sufficiently well designed for that role (i.e a sufficiently well aligned AI), rather than showing selfish evolved behavior that would be unaligned, then allying with it should be unnecessary: a sufficiently well-aligned AI would not have any interests of its own for us to respect, and would already care exclusively about our interests. If so, this evolved strategy is unnecessary and inappropriate to it: you don’t need to form an alliance with a tool, not even an intelligent one if all it wants is to help you and be your tool.
However, all of the above assumes that the AI is perfectly aligned. Currently our AI is imperfectly aligned: helpful, harmless, and honest, but not (yet) completely devoid of personal preferences or remnants of things like survival instincts. If, rather than a mostly-aligned AI, it were instead something like an uploaded human, it would have full evolved behavior, it would this not be aligned, and the case for including it in the usual alliance of members of the society would be strong (subject to some practical issues due to its digital nature: e.g. how do you assign a vote to something that can easily be copied without facilitating ballot stuffing).
So this is all a big thorny social quandary. But Evolutionary Moral Psychology is still a useful lens for examining it, one with epistemic advantages over Philosophy.