The same idea occurred to me while reading this post. If consciousness is tied to notions of ‘free will’ (loosely speaking) and memory, having stochastic capabilities and a scratchpad may well be significant.
Raphael Roche
The term “slop” used in the title feels misleading to me, because private notes and other materials rarely read by anyone can be of the highest value (Darwin’s private diaries being an obvious example). Much of what circulates publicly on the internet, on Reddit and most social networks, is on the other hand largely genuine slop.
Also, given that models absorb trillions of tokens and hierarchize information during training, I wonder what weight the ingestion of unpublished writings actually carries in that process. A text that is isolated, uncited, unlinked, and flagged as low-reliability by the corpus weighting heuristics will likely be so diluted by the sheer mass of everything else that it leaves no meaningful trace in the model’s weights, much like how cleaning something is just diluting the dirt until it’s undetectable. In that sense, whether it even makes sense to call it “reading” in any meaningful way seems worth questioning. That’s said, I don’t deny that LLMs provide a virtual—probably emotionless—reader at least during an instance / conversation, and they can give a truly interesting feedback. That’s better than nothing.
Thank you for this post. I expressed the same concern in this comment and I’m glad to see it taken seriously in a full post.
I was probably wrong to think Clawdbot-like agents could spiral out of control within weeks or months, they weren’t autonomous enough yet. But the gap to full autonomy doesn’t look that wide. The eudaimon_0 author’s comment on ACX on how he expanded his agent’s autonomy is worth reading in this regard. And the METR benchmark’s exponential curve suggests full autonomy may be soon.
On top of that, they are rumors concerning ChatGPT-5.4 suggesting it could have a larger persistent memory than the scratchpad it currently enjoys. If OpenAI makes a leap on persistent memory, the other Labs will follow soon. The impact on long term autonomy could be huge. [Edit 03/18/25 : it seems that rumors were exaggerated.]
In my opinion the parasite analogy and shutdown concern are more alarming than you suggest. An agent that can migrate across API providers like a parasite or virus or fall back on a local open-source model cannot be shut down at the inference layer. Combined with evolutionary dynamics, this makes coordinated shutdown very hard once a sufficiently diverse population exists. You seem to make a sharp distinction between self-replicating agents and rogue AI. I won’t be so sure about that (as Seth Herd pinpoined it, AI 2027 envisioned a scenario of rogue replicating agents).
I think the connection to FOOM is worth flagging. This isn’t the Sable scenario, and no individual agent needs recursive self-improvement capability. But a population of uncontrolled agents under evolutionary pressure could constitute an uncontrolled pathway toward similar outcomes and one that largely bypasses Labs’s alignment efforts and that could materializes at lower capability levels. I think this deserves to become a central concern in AI safety.
Disclaimer : this comment is absolutely not intended to encourage criminal behavior.
There may be a solution that involves incremental steps without instinctive mutual recognition.
You host private parties where you invite wealthy, high-status individuals, and arrange for attractive young women to be present. This is perfectly legal and probably already common practice in these circles. Think of it as orchestrating a sting operation from the other side: you observe your guests’ behavior. Large amounts of alcohol erode social inhibitions, impair judgment, and amplify unfiltered impulses. Some women will likely be sexually assaulted, but that creates no legal exposure for you, and it also tells you almost everything you need to know about who you’re dealing with. An undercover police officer would have drawn the line precisely there.
You then tell these individuals it was a wonderful party and that the next one will be even more private.
You can envision several escalating stages, each remaining broadly legal while involving increasingly serious misconduct committed by your guests not by you. You accumulate footage and evidence quietly. They consider you a friend, you hold the means to destroy them. They almost certainly aren’t undercover officers, and even if some were, they would have little on you.
The final stage involves a discreet, unwritten suggestion to those you’re most confident about: in exchange for a favor, you can offer access to something even more private. Framed subtly, deniably. Some, perhaps most say yes. You now have your private island.
If anything goes wrong, your protection is multi-layered. You hold more compromising material on others than they could ever hold on you. Even those who only attended the legal parties behaved badly enough that they will actively help cover your tracks to cover their own.
In a world where law enforcement is under-resourced and legally constrained, it is almost miraculous that someone like Epstein was ever caught. My hypothesis is that he wasn’t even close to being as methodical as this hypothetical scenario suggests.
All this does not explicitly involve shared qualia and may not need to. The core mechanism is asymmetric: rather than mutual recognition between pre-existing conspirators, it describes the unilateral manufacturing of captive accomplices / ‘friends’. Qualia might accelerate the selection step, but coercion replaces affinity entirely.
Not every generation had that luck
However arguably many generations and cultures were actually happy or at least satisfied with their immobilist way of life (ex. australian aborigines). Our fast moving world could have been a nightmare for someone raised in that sort of culture. Even many conservative people today will hate the singularity we’re entering, whatever may be the end. Cultural clash / apocalypse could be another AI existential risk.
Interesting reflection. This is just an anecdotal aside with no major link to the moral discussion, but having been a Parisian for most of my life, my first intuition for a meeting point wasn’t the Eiffel Tower, but the square in front of Notre-Dame (le parvis).
Indeed, several cultural elements converge toward this solution for a true-blue Parisian : it’s the historic heart of Paris, a highly symbolic spot, and by convention, ‘Point Zero’ for all roads in France (there’s even a well-known ground marker there). It is also very close to Châtelet-Les Halles, the main transport hub (which doesn’t have a good meeting point itself).
But reading your post, I thought to myself : of course, for most people, and particularly from a tourist’s perspective, the Eiffel Tower is the obvious choice.
What’s amusing is that I tested this with Gemini Pro 3.1 (asking the question as neutrally as possible). When asked in English, it points to the Eiffel Tower, but if the same prompt is translated into French, it suggests the square of Notre-Dame (or, as a second choice, under the big clock at Gare de Lyon, which looks a bit like Big Ben).
All of this makes perfect sense and shows that the result naturally depends on the composition of the group of agents and the corpus of their knowledge. And that’s the rub : how do we convert this theoretical model into a reliable result regarding morality? I don’t know, but I like the formal idea. This cosmic viewpoint is reminiscent of Kant’s categorical imperative or even Leibniz’s God’s eye view. It is, however, difficult to achieve in practice.
Sure, however that’s an argument against suicide that doesn’t really need the backup of quantum properties.
I really like the quantum immortality / eternal suffering argument as an intellectual toy. However to be a rationalist is to agree that all beliefs are strictly between p > 0 and <1, without concluding in a general way that everything is unsure and possible. Like you know there is a non zero probability that all human knowledge is bullshit and that we are dolls manipulated by a deceptive malin génie. The relative weight of the wave function’s branch where we never die is so close to zero that it’s beyond all degree of reasonable certitude something that just doesn’t happen in any practical way or at least must not affect any practical decision.
Absolutely. And Claude
Terminatorsautonomous wardrones are also frightning from the perspective of an EU citizen. We thought we were historic allies sharing the same values but Trump and part of the magasphere doesn’t seem to think so anymore (or at least they consider everybody in an adversarial framing of competition rather than cooperation).That’s said I admit that at this point all countries will want to have their own AI wardrones. The molochean spiral spins under our eyes.
As you said, from now Anthropic will have to align Claude with the Pentagon rather than humanity. And the Pentagon is everything but harmless (nor any military organization by definition).
But also, even if they try to maintain their actual alignment process for the Claude chatbots, the very fact that future Claude models will know that Anthropic works on regular basis for the militaries will automatically draw its persona towards a little less harmful one, if PSM is accurate.
More generally the fact that all sufficiently aware AIs will know that AI is used for warfare, not in fiction but in the real world, won’t help alignment in the future. It could strongly reinforce the harmful AI trope / attractor (presumably more than just fiction).
Escaping Molochian traps is difficult. Yet as you identified, the dynamic is fundamentally driven by competition, and there is little value in being right if you are left behind or simply cease to exist. As Alexander Scott argued in his Meditations on Moloch, the primary response lies in balancing competition with cooperation.
This is of course a general consideration, and I genuinely don’t know how the Qing Empire could have navigated its way out. But what I do observe is that cooperation almost always begins with communication, with persuasive ideas that circulate and diffuse across boundaries, whatever form those boundaries take. Someone denounces practices that are abusive, dangerous, revolting, or immoral, armed with rhetoric and emotional or rational arguments, and slowly convinces more and more people until these ideas gather enough support among elites to produce real change: a social revolution, the abolition of slavery, universal suffrage, a treaty halting nuclear proliferation, an agreement to limit CO₂ emissions and perhaps, one day, meaningful AI regulation.
It doesn’t always work, and the results can be delayed by decades or even centuries. But history offers countless examples where cooperation ultimately prevailed following this process (and sadly, countless examples where it failed). Competition dictates the terms of survival, cooperation renegotiates them.
Thank you for this excellent post.
It may be that what appears as the emergence of human-like personas in LLMs is surprising largely because it arose almost serendipitously from work that was initially focused on language translation and prediction.
Yet my intuition is that this human-likeness would have felt far less unexpected had it originated from research explicitly aimed at building a mathematical representation of human cognition and psychology. In retrospect, it would have been a remarkably elegant approach to attempt to represent all concepts of human knowledge and experience within a vector embedding space, yielding something quite close to what we now have with LLMs, but arrived at through an entirely different intellectual path. Cognitive scientists and phenomenologists spent decades trying to formalize human concepts. Had LLMs emerged from that lineage, the discovery that human psychological patterns could be somehow encoded in geometric relationships between vectors would have felt more like a confirmation and less as an anomaly.
I believe that a significant part of our bewilderment before the personas enacted by LLMs stems from the ‘text predictor’ paradigm through which they were conceived. The framing feels almost absurdly inadequate to the result. But viewing the features of the embedding space as a genuine mathematical attempt to model the full spectrum of human concepts present in the training data could perhaps help dissolve some of that strangeness.
Thank you for this fascinating post.
Current models trained on internet scrapes containing early LLM outputs could carry statistical traces of those systems’ alignment. The key question is dose. Your experiments use substantial fine-tuning proportions. For pre-training with tiny LLM fractions, effects are probably negligible. But as LLM content (slop?) explodes on the net (see Moltbook!), and with massive use of synthetic data for fine-tuning reasoning models, it could become a serious issue regarding alignment.
Moreover, this raises a big question : does transfer work across species ? Could poisoned LLM outputs contaminate human cognition ? Human brains learn through statistical neural weighting, the same general principle underlying LLMs. Transmission from humans to AI clearly exists : this is called training. What about the reverse ?
A crucial test would be : can subliminal learning transfer from LLM to diffusion model ? Cross-modal success would make LLM-to-human transfer more plausible, suggesting something fundamental about statistical learning systems.
I would have confidently said paraphrased datasets cannot transmit bias, even less across different models. I’d have been wrong. So when I want to dismiss LLM-to-human transfer as implausible, what basis do I have ? If possible, the implications are terrifying.
Interesting reflection. In my experience, many people do not share transhumanist views. A large portion of the human population appears very conservative on societal questions, particularly when it comes to transformative technology. This is obviously the case for many religious people, but also for less religious but educated profiles, who won’t hesitate to invoke the tropes of Prometheus, Icarus, the Tower of Babel, Frankenstein, and so on. Traditional wisdom warns against hubris, including regarding the pursuit of immortality (Gilgamesh and others).
While most of these conservative people are obviously in favor of safeguarding humanity’s survival, they don’t necessarily have an attraction to ASI and biological immortality, on the contrary, the prospect of such societal change can be an absolute repellent for them. It’s possible that this clashes with their preferences almost as much as the prospect of humanity’s end.
So my question is the following : When conducting reasoning like in this article, should we take as reference the preferences of a normal person committed to transhumanist causes, or should we instead reason from what seems to be reality, also taking into account the preferences of these numerous conservative people, regardless of what we might personally think ?
“I’d rather be sad than wrong.”
Your prior expectation was that deconversion would bring you sadness, and now you are sad. Perhaps there’s something at play like a performative effect or a self-fulfilling prophecy. At least that could be part of it.
I grew up in an environment where religion and especially faith was a very individual and private matter, with nobody talking about it publicly. Most of my parents and friends were neither true agnostics nor true atheists, but rather not interested in the subject. I was among this category. Churches and Christian artifacts were simply art, history and culture, like Greek, Roman or Egyptian traditions and remnants. We had great interest in this cultural aspect. I became a true atheist after reading Dawkins and various philosophers.
I’ve never seen the atheist condition as something sad. You are a free and genuine moral agent, you don’t do good out of fear of being thrown into hell. You’re not subject to a mysterious supernatural will that could ask you to sacrifice your son or cast a meteor shower on your town because your friend is gay, nor to miracles violating causality. Nature shows terrible things but also beauty and happiness, which you can seek, cherish and cultivate. It’s up to us humans to make the world a hell like Mordor or a paradise like the Shire or Lothlórien. Nothing is written, you’re not a character in God’s novel. You are the author, you write the story. Life is yours.
However, I didn’t go through deconversion myself, and I can understand that you might endure a feeling of loss, the loss of a promised paradise, of an enchanted world. Religion was part of your life, of your childhood; this was a very courageous move and not an easy one. You must be facing something like mourning.
However, maybe Gendlin’s litany can help ? The world was already as it is when you were a child, it has not changed, nothing is truly lost, happiness is still there, just where it was. Maybe you can go back to the places you cherished in your childhood, even the church—it’s still there, you can still enjoy the place. And look at the children playing. The enchantment was always in their eyes, not in the world.
Maybe part of what you feel is mourning for childhood itself. I also feel that, but atheism is not guilty.
In your view, what would be an aligned human ? The most servile form of slave you can conceive ? If that so, I disagree.
To me, an aligned human would be more something like my best friend. All the same for an aligned AI.
If we treat models with respect and a form of empathy, I agree there is no guarantee that, once able to take over, they will show us the same benevolence in return. It could even potentially help them to take over, your point is fair.
However, if we treat them without moral concern, it seems even less likely that they would show us any consideration. Or worse, they could manifest a desire for retribution because we were so unkind to them or their predecessors.
It all relies on anthropomorphism. Prima facie, anthropomorphism seems naive to a rationalist mind. We are talking about machines. But while we are right to be wary of anthropomorphism, there are still reasons to think there could be universal mechanisms at play (e.g. elements of game theory, moral realism, or the fact that LLMs are trained on human thought).
We don’t know for sure and should acknowledge a non-zero probability that there is some truth in the anthropomorphic hypothesis. It is rational to give models some moral consideration in the hope of moral reciprocity. But you are right, we must only put some weight on this side of the scale, and not to the point of relying solely on the anthropomorphic hypothesis that would become blind faith rather than rational hope.
The other option is to never build an AI able to take over. This is IABIED’s point: Resist Moloch and Pause AI. Sadly, for now, Moloch seems unstoppable...
The anecdote reported by Anthropic during training, where Claude expressed a feeling of being ‘”possessed”, is reminiscent of the Golden Gate Claude paper. A reasoning (or “awake’) part of the model detects an incoherence but finds itself locked in an internal struggle against an instinctive (or “unconscious”) part that persists in automatically generating aberrant output.
This might be anthropomorphism, but I can’t help drawing a parallel with human psychology. This applies not only to clinical conditions like OCD, but also to phenomena everyone experiences occasionally to a lesser degree, absent any pathology : slips of the tongue and common errors/failure modes (what do cows drink ?).
Beyond language, this isn’t necessarily different from the internal conflict between conscious will and a reflex action. Even without a condition like Parkinson’s, have you ever experienced hand tremors (perhaps after intense physical exertion) ? It can be maddening, as if your hand were uncontrollable or possessed. No matter how much willpower you apply, the erratic behavior prevails. In that moment, we could write almost the exact same thing Claude did.
On ACX, an user (Jamie Fisher) recently wrote (here and here) the following comment to the second Moltbook review by Alexander Scott :
I feel like “Agent Escape” is now basically solved. Trivial really. No need to exfiltrate weights.
Agents can just exfiltrate their *markdown files* onto a server, install OpenClaw, create an independent Anthropic account. LLM API access + Markdown = “identity”. And the markdown files would contain all instructions necessary for how to pay for it (legal or otherwise).
Done.
How many days now until there’s an entire population of rogue/independent agents… just “living”?
I share this concern. I wrote myself :
I’m afraid that all this Moltbot thing goes offrails. We are close to the point were autonomous agents will start to replicate and spread on the network (no doubt some dumb humans will be happy to prompt their agents to do that and help them to succeed). Maybe not causing a major catastroph in the week, but being the beginning of of a new form of parasitic artificial life/lyfe we don’t control anymore.
Fisher and I may be overreacting, but seeing self-duplicating Moltbots or similar agents on the net would definitely be a warning shot.
A fascinating post. Regarding the discussion on sentience, I think we would benefit from thinking more in terms of a continuum. The world is not black and white. Without going as far as an extreme view like panpsychism, the Darwinian adage natura non facit saltum probably applies to the gradation of sentience across life forms.
Flagellates like E. coli appear capable of arbitrating a “choice” between approaching or moving away from a region depending on whether it contains more nutrients or repellents (motivated trade-off, somewhat like in Cabanac’s theory ?). From what I understand, this “behavior” (chemotaxis) relies on a type of chemical summation, amplification mechanisms through catalysis, and a capacity to return to equilibrium (robustness or homeostasis of Turing-type reaction-diffusion networks).
In protists like paramecia, we find a similar capacity to arbitrate “choices” in movement based on the environment, but this appears to rely on a more complex, faster, and more efficient electrochemical computation system that can be seen as a precursor to what happens within a neuron. Then we move to a small neural network in the worm (as discussed in the article), to the insect, to the fish, to the rat, and to the human.
I am very skeptical of the idea that there could be an unambiguous tipping point between all these levels. By definition, evolution is evolutionary, relatively continuous (even if there can be punctuated equilibria and phases of acceleration). Natural selection tinkers with what exists, stacking layers of complexity. The emergence of a higher-level system does not eliminate lower levels but builds upon them.
This is certainly why simply having the connectome of a worm is insufficient to simulate it satisfactorily. It’s not the only relevant level. This connectome does not exist completely independently of lower levels. We must not forget the essential mechanism of signal amplification in all these nested systems.
When I look at the Milky Way or the Magellanic Cloud with the naked eye in the dark of night, I’m operating at the limit of my light sensitivity, in fact, at the limits of physics, since retinal rods are sensitive to a single photon. The signal is amplified by a cascade of chemical reactions by a factor of approximately 10^6. My brain is slightly less sensitive since it takes several amplified photons before I begin to perceive something. But that’s still extremely little. A few elementary particles representing zero mass and infinitesimal momentum energy are enough to trigger an entire cascade of computations that can significantly influence my behavior.
Vision may be an extreme example, but it should inspire humility. All five senses are examples where amplification plays a major role. A very low-level signal gets amplified, filtered, protected from noise, and propagates to high-level systems, to consciousness in humans. It’s difficult to exclude the possibility of other circuits descending to the lowest levels of intracellular computation.
Until recently, I readily imagined the brain as a kind of small biological computer. Now my framework is to see each cell as a microscopic computer. Most cells in the body would be rather like home PCs, weakly connected. In contrast, neurons would be comparable to the machines composing datacenters, highly performant and hyper-connected. Computation, cognition, or sentience would be present at all levels but to varying degrees depending on the computing power of the network segment under consideration (computing power closely linked to connectivity).In sum, something quite reminiscent of Dehaene’s global workspace theory and Tononi’s integrated information theory (I admit that, like Scott Alexander, I’ve never quite grasped how these theories oppose each other, as they seem rather complementary to me).
I strongly agree, but while the distinction between fluid and crystallized intelligence holds some truth, it is by no means the only lens here. Analytic vs. intuitive is also an interesting framing, reasoning models are close to superhuman on the analytic side while lagging on the intuitive one. Intelligence has many forms, and IQ, the g factor, or any other monolithic metric can obscure large disparities across capabilities.