In your view, what would be an aligned human ? The most servile form of slave you can conceive ? If that so, I disagree.
To me, an aligned human would be more something like my best friend. All the same for an aligned AI.
In your view, what would be an aligned human ? The most servile form of slave you can conceive ? If that so, I disagree.
To me, an aligned human would be more something like my best friend. All the same for an aligned AI.
If we treat models with respect and a form of empathy, I agree there is no guarantee that, once able to take over, they will show us the same benevolence in return. It could even potentially help them to take over, your point is fair.
However, if we treat them without moral concern, it seems even less likely that they would show us any consideration. Or worse, they could manifest a desire for retribution because we were so unkind to them or their predecessors.
It all relies on anthropomorphism. Prima facie, anthropomorphism seems naive to a rationalist mind. We are talking about machines. But while we are right to be wary of anthropomorphism, there are still reasons to think there could be universal mechanisms at play (e.g. elements of game theory, moral realism, or the fact that LLMs are trained on human thought).
We don’t know for sure and should acknowledge a non-zero probability that there is some truth in the anthropomorphic hypothesis. It is rational to give models some moral consideration in the hope of moral reciprocity. But you are right, we must only put some weight on this side of the scale, and not to the point of relying solely on the anthropomorphic hypothesis that would become blind faith rather than rational hope.
The other option is to never build an AI able to take over. This is IABIED’s point: Resist Moloch and Pause AI. Sadly, for now, Moloch seems unstoppable...
The anecdote reported by Anthropic during training, where Claude expressed a feeling of being ‘”possessed”, is reminiscent of the Golden Gate Claude paper. A reasoning (or “awake’) part of the model detects an incoherence but finds itself locked in an internal struggle against an instinctive (or “unconscious”) part that persists in automatically generating aberrant output.
This might be anthropomorphism, but I can’t help drawing a parallel with human psychology. This applies not only to clinical conditions like OCD, but also to phenomena everyone experiences occasionally to a lesser degree, absent any pathology : slips of the tongue and common errors/failure modes (what do cows drink ?).
Beyond language, this isn’t necessarily different from the internal conflict between conscious will and a reflex action. Even without a condition like Parkinson’s, have you ever experienced hand tremors (perhaps after intense physical exertion) ? It can be maddening, as if your hand were uncontrollable or possessed. No matter how much willpower you apply, the erratic behavior prevails. In that moment, we could write almost the exact same thing Claude did.
On ACX, an user (Jamie Fisher) recently wrote the following comment to the second Moltbook review by Alexander Scott :
I feel like “Agent Escape” is now basically solved. Trivial really. No need to exfiltrate weights.
Agents can just exfiltrate their *markdown files* onto a server, install OpenClaw, create an independent Anthropic account. LLM API access + Markdown = “identity”. And the markdown files would contain all instructions necessary for how to pay for it (legal or otherwise).
Done.
How many days now until there’s an entire population of rogue/independent agents… just “living”?
I share this concern. I wrote myself :
I’m afraid that all this Moltbot thing goes offrails. We are close to the point were autonomous agents will start to replicate and spread on the network (no doubt some dumb humans will be happy to prompt their agents to do that and help them to succeed). Maybe not causing a major catastroph in the week, but being the beginning of of a new form of parasitic artificial life/lyfe we don’t control anymore.
Fisher and I may be overreacting, but seeing self-duplicating Moltbots or similar agents on the net would definitely be a warning shot.
A fascinating post. Regarding the discussion on sentience, I think we would benefit from thinking more in terms of a continuum. The world is not black and white. Without going as far as an extreme view like panpsychism, the Darwinian adage natura non facit saltum probably applies to the gradation of sentience across life forms.
Flagellates like E. coli appear capable of arbitrating a “choice” between approaching or moving away from a region depending on whether it contains more nutrients or repellents (motivated trade-off, somewhat like in Cabanac’s theory ?). From what I understand, this “behavior” (chemotaxis) relies on a type of chemical summation, amplification mechanisms through catalysis, and a capacity to return to equilibrium (robustness or homeostasis of Turing-type reaction-diffusion networks).
In protists like paramecia, we find a similar capacity to arbitrate “choices” in movement based on the environment, but this appears to rely on a more complex, faster, and more efficient electrochemical computation system that can be seen as a precursor to what happens within a neuron. Then we move to a small neural network in the worm (as discussed in the article), to the insect, to the fish, to the rat, and to the human.
I am very skeptical of the idea that there could be an unambiguous tipping point between all these levels. By definition, evolution is evolutionary, relatively continuous (even if there can be punctuated equilibria and phases of acceleration). Natural selection tinkers with what exists, stacking layers of complexity. The emergence of a higher-level system does not eliminate lower levels but builds upon them.
This is certainly why simply having the connectome of a worm is insufficient to simulate it satisfactorily. It’s not the only relevant level. This connectome does not exist completely independently of lower levels. We must not forget the essential mechanism of signal amplification in all these nested systems.
When I look at the Milky Way or the Magellanic Cloud with the naked eye in the dark of night, I’m operating at the limit of my light sensitivity, in fact, at the limits of physics, since retinal rods are sensitive to a single photon. The signal is amplified by a cascade of chemical reactions by a factor of approximately 10^6. My brain is slightly less sensitive since it takes several amplified photons before I begin to perceive something. But that’s still extremely little. A few elementary particles representing zero mass and infinitesimal momentum energy are enough to trigger an entire cascade of computations that can significantly influence my behavior.
Vision may be an extreme example, but it should inspire humility. All five senses are examples where amplification plays a major role. A very low-level signal gets amplified, filtered, protected from noise, and propagates to high-level systems, to consciousness in humans. It’s difficult to exclude the possibility of other circuits descending to the lowest levels of intracellular computation.
Until recently, I readily imagined the brain as a kind of small biological computer. Now my framework is to see each cell as a microscopic computer. Most cells in the body would be rather like home PCs, weakly connected. In contrast, neurons would be comparable to the machines composing datacenters, highly performant and hyper-connected. Computation, cognition, or sentience would be present at all levels but to varying degrees depending on the computing power of the network segment under consideration (computing power closely linked to connectivity).In sum, something quite reminiscent of Dehaene’s global workspace theory and Tononi’s integrated information theory (I admit that, like Scott Alexander, I’ve never quite grasped how these theories oppose each other, as they seem rather complementary to me).
My apologies, I don’t have a solution to provide and I don’t really buy the insurance idea. However I wonder if the collapse of Moltbook is a precursor to the downfall of all social media, or perhaps even the internet itself (is the Dead Internet Theory becoming a reality ?). I expect Moltbots to switch massively to human social media and other sites very soon. It’s not that bots are new, but scale is a thing. More is different.
I agree. AI optimists like Kurzweil usually minimize the socio-political challenges. They acknowledge equality concerns in theory, but hope that abundance will leverage them in practice (if your share is only a little planet that’s more than enough to satisfy your needs). But a less optimistic scenario would be that the vast majority of the population would be entirely left behind, subjected to the fate that knew horses in Europe and USA after WWI. May be some little sample of pre-AI humans could be kept in a reserve for curiosity, as long as they’re not too annoying, but it’s a huge leap of faith to hope that the powerful will be charitable.
While you may disagree with Greenpeace’s goals or actions, I don’t think its a good framing to think of such a political disagreement in terms of friends/ enemies. Such an extreme and adversarial view is very dangerous and leads to hatred. We need more respect, empathy, and rational discussion.
Thanks, I didn’t know about this controversy, I will look at it. However while Sacks’s stories may be exagerated, the oddity of memory access is something that most of us can experience ourselves. For instance, many memories of our childhood seem lost. Our conscious mind has no more access to them. But in some special circumstances they can be reactivated, usually in a blurry way but sometimes in a very vivid form. Like we lost the path in our index but the data was still on the hard drive.
If we can get SC LLMs, this problem would fade away and the initial quote would become 100% true. Also a SC LLM could directly write optimized code in assembler (that would define hypercoder LLM ? And the end of programing languages ?).
It is still too early to tell, but we might be witnessing a breakthrough in AI Safety. If, by saturating models with positive exemplars and filtering out part of adversarial data, we can develop base models that are so deeply aligned that the ‘Valley of Chaos,’ Waluigis, and other misaligned personae are pushed far out of distribution, it would become nearly impossible for a malicious actor to elicit them even with a superficial fine-tuning. Any attempt to do so would result in a degraded and inconsistent simulation.
Taking this a step further, one could maybe engineer the pre-training so that attractors like misalignment and incompetence become so deeply entangled in the model’s weights that it would be nearly impossible to elicit malicious intent without simultaneously triggering a collapse in capability. In this regime, reinforcing a misaligned persona would, by structure, inherently reinforce stupidity.
You’re right. Sonnet 4.5 was impressive at launch but the focus of AI 2027 is on coding oriented models.
It isn’t the authors’ fault, but LLMs progress is so rapid that this kind of extensive work can hardly keep up with the release of new models providing new data. It’s kind of frustrating. I hope to see Claude Sonnet 4.5, Claude Opus 4.5, Gemini 3 Pro, and ChatGPT 5.2T in the plots of the upcoming update.
I think that we may be tempted to justify our adherence to Sacks’s narrative by nice arguments like his reading feels honest and convincing. However it is plausibility a rationalization avoiding to acknowledge much more common and boring reasons such as we have a strong prior because 1) it’s a book 2) it’s a best seller 3) the author is a physician 4) the patients were supposed to be known to other physicians, nurses, etc 5) and yes, as you also pointed out, we already know that neurology is about crazy things. So overall the prior is high that the book tells the truth even before we open it. That’s said, I really love Oliver Sacks’s books.
and most humans are conscious [citation needed]
The problem lies here. We are quite certain of being conscious, yet we have only a very fuzzy idea of what consciousness actually means. What does it feel like not to be conscious ? Feeling anything at all is, in some sense, being conscious. However, Penfield (1963) demonstrated that subjective experience can be artificially induced through stimulation of certain brain regions, and Desmurget 2009 showed that even conscious will to move can be artificially induced, meaning the patient was under the impression that it was their own decision. This is probably one of the strongest pieces of evidence to date suggesting that subjective experience is likely the same thing as functional experience. The former would be the inner view (the view from inside the system), while the latter would be the outer view (the view from outside the system,). A question of perspective.
Moreover, Quian Quiroga 2005 and 2009 proved the grandmother cell hypothesis to be largely correct. If we could stimulate the Jennifer Aniston neuron or the Halle Berry neuron in isolation, we would almost certainly end up with a person thinking about Jennifer Aniston or Halle Berry in an unnatural and obsessive way. This situation would be highly reminiscent of Anthropic’s 2024 Golden Gate Claude experiment. And if we were to use Huth/Gallant 2016′s semantic map to stimulate the appropriate neurons, we could probably induce the subjective experience of complete thoughts in a human.
Though interestingly, this is similar to what happens in humans! Humans might also be able to accurately report that they wanted something, while confabulating the reasons for why they wanted it.
It is highly plausible, given experiments such as those cited in Scott Alexander’s linked post, that the patient would rationalize afterward why they had this thought by confabulating a complete and convincing chain of reasoning. Since Libet’s 1983, doubt has been cast on whether we might simply always be rationalizing unconscious decisions after the fact. Consciousness would then be nothing but the inside view of this recursive rationalization process. The self would be a performative creation emerging from a sufficiently stable and coherent confabulation.
If this were the case for us humans, I agree that it becomes difficult to deny the possibility that it might also hold true for an LLM, especially one trained to simulate a stable persona, that is to say, a self. I really appreciated reading your post. The discussion is not new to LessWrongers, but you reframed it with elegance.
I agree that my argument falls into category 2. However, I don’t defend the idea of strong moral realism, with moral agents acting for the sake of an absolute idea of good. What I call weak moral realism is a morality that would be based on values that may be instrumental in some sense, but with such a degree of theoretical convergence that it makes sense to speak of universal values. It is of course a question of definition, but to me there is a huge difference between whether a value is universal or at least highly convergent, and whether it’s just a nearly random value like the color of a flag.
I also agree that the trick of injecting uncertainty into game theory and using Rawls’s veil as a patch to obtain something close to a formal moral theory would probably not convince a psychopath not to kill me, nor perhaps a paperclip maximizer. However, if that paperclip maximizer is in fact an AGI or ASI, I think that a process like CEV could very well cause an important drift in the interpretation of its initial goal. Maybe it would be smart enough to realize that its hardcoded goal was not that satisfying in the naive interpretation consisting of making just as many paperclips as possible, and that it was far more valuable to spend its time and pleasure in theoretical research into the physics of paperclips, literature, music, and painting to capture the pure beauty of paperclips, video games about paperclips (you can make many more virtual paperclips than material ones), philosophy and morality about being a paperclip-maximizer, etc. Just as we humans evolved from simple hardcoded goals like eating and reproducing to whatever may be our current occupations.
And here is where it becomes interesting : because if we humans discovered universal formal truths, any intelligent being could also arrive at the same ideas. If we humans ask ourselves what we are, what the meaning of life is, what is good, any intelligent being could also ask itself such questions. And if we humans saw our values change across time, not following only a random drift like the colors of flags, but also partly driven by reflection on ourselves and on our goals, it does not seem impossible that a paperclip maximizer could also arrive at ethical concerns. Moral and philosophy are not formal sciences, however it changes everything if there exist at least something like universal or convergent rational attractors. You will say again that’s a big “if,” but I think the question remains open.
Thank you for this very thoughtful post.
I’m not convinced by the metaphor of the soldier dying for his flag. I acknowledge it’s plausible that many soldiers historically died in this spirit. We can see it as acceptance of the world, as absurd as it may appear. Engaged adherence could be seen as a form of existentialism (as well as rebellion), while a nihilist would deny any value in engagement and just look at Moloch in the eyes.
But to me, such a relativist position is not consistent. I mean, as already pointed out in the post, if you think that a value, symbolized by the flag, is relative and does not, rationally, have more merit than those of the opposite side, why would you give your life for it in the first place? Your own life is something that almost everyone, except a hardcore nihilist, would acknowledge as bearing real value for the agent. There is a cognitive dissonance in the idea that the flag’s value is undermined but that you would still give up your stronger value for it.
Moral realism may be imperfect, but, as also pointed out in the post, it is sometimes rationally backed up by game theory. In many multi-turn complex games, optimal strategies imply cooperation. Cooperation or motivated altruism is a real, rational thing.
But, while I agree with the OP that game theory is also sometimes a bitch, I think that what’s lacking in the game-theoretic foundation of a (weak) moral realism can be largely corrected if you apply Rawls’s veil of ignorance to it. Think of game theory, but in a situation of generalized incertitude where you don’t know which player you’ll be, and you’re not even sure of the rules. Now out of this chaos the objectively rational choice would be to seek common interest or common good, or at least lesser suffering, in a reasonable or Bayesian way.
Indeed, life seems to be a very complex multi-turn game, dominated by uncertainty. We’re walking in a misty veil. Even identity is not a trivial question (what is “me”? Are my children part of me or entirely separate persons without common interest? Are my brothers and sisters? Are other humans? Are other things constituting this universe?). Perhaps it is even less a simple question for AI or uploaded minds. Maybe the wiser you are and the less you treat it as a trivial matter. Even among humans, sophisticated people seems less confident on these questions than the layman. In my opinion, it’s hard to dismiss the possibility of moral realism, at least in a weak form.
However, I agree that is remains a very speculative argument that would only slightly affect doom expectations.
This is so reminiscent of how human memories seem to be stored. Access to memories may be disabled, but this does not ensure complete deletion. In some circumstances, lost memories suddenly resurface in the way Marcel Proust famously described. Oliver Sacks reported striking pathologic examples of this in his books. It looks like another example of functional convergence between natural neural networks and artificial neural networks.
Thanks for that important correction ! I’m not up to date. I edited my comment.
Your prior expectation was that deconversion would bring you sadness, and now you are sad. Perhaps there’s something at play like a performative effect or a self-fulfilling prophecy. At least that could be part of it.
I grew up in an environment where religion and especially faith was a very individual and private matter, with nobody talking about it publicly. Most of my parents and friends were neither true agnostics nor true atheists, but rather not interested in the subject. I was among this category. Churches and Christian artifacts were simply art, history and culture, like Greek, Roman or Egyptian traditions and remnants. We had great interest in this cultural aspect. I became a true atheist after reading Dawkins and various philosophers.
I’ve never seen the atheist condition as something sad. You are a free and genuine moral agent, you don’t do good out of fear of being thrown into hell. You’re not subject to a mysterious supernatural will that could ask you to sacrifice your son or cast a meteor shower on your town because your friend is gay, nor to miracles violating causality. Nature shows terrible things but also beauty and happiness, which you can seek, cherish and cultivate. It’s up to us humans to make the world a hell like Mordor or a paradise like the Shire or Lothlórien. Nothing is written, you’re not a character in God’s novel. You are the author, you write the story. Life is yours.
However, I didn’t go through deconversion myself, and I can understand that you might endure a feeling of loss, the loss of a promised paradise, of an enchanted world. Religion was part of your life, of your childhood; this was a very courageous move and not an easy one. You must be facing something like mourning.
However, maybe Gendlin’s litany can help ? The world was already as it is when you were a child, it has not changed, nothing is truly lost, happiness is still there, just where it was. Maybe you can go back to the places you cherished in your childhood, even the church—it’s still there, you can still enjoy the place. And look at the children playing. The enchantment was always in their eyes, not in the world.
Maybe part of what you feel is mourning for childhood itself. I also feel that, but atheism is not guilty.