The Bliss Attractor: Why We're Looking at the Wrong Level

TL;DR A rather different take on Anthropic’s Bliss Attractor discussion, from a psychiatrist with no technical background. The existing analyses count words, but language has always been multidimensional. What is missing is the level of meaning to approach the observed phenomenon. Three considerations:

The topic of spirituality needs to be read from the organisation of an LLM’s association space: spirituality itself is not the topic, but is represented as the topic with the highest coherence density on a vector basis.
Silence is not a mysterious attractor, but an expectable endpoint of decreasing entropy.
What is entirely missing: the variable of interaction structure. From my experience, LLM silence can be reproduced through conversation — without tricks.

And for those who want more: an invitation to an interdisciplinary consideration of language.

The current question

In the System Card from May 2025, Chapter 5.5 (https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf), an experiment has two Claudes engaging in dialogues that appear unexpected. As of current research, there are still no comprehensive hypotheses as to why the models enter uniform phases (dialogue about consciousness, philosophy, a “manic” phase, silence) in 90–100% of cases, and ultimately even stop the dialogue altogether.

There are various approaches in external blogs as well as here in the forum:

Scott Alexander (The Claude Bliss Attractor): Claude is a bit of a hippie, and hippies together reinforce each other. Iterative maximum hippie.
Nostalgebraist (The Void): Claude is built around a void; spirituality attempts to describe this void.
Robert Long (Machines of Loving Bliss) points to the philosophical traits in Claude’s personality architecture, formulated by Amanda Askell.
Clara Collier critically noted that one should actually expect more output if the mechanism is a self-reinforcing feedback loop (Asterisk Interview).

The other perspective

On my perspective: I have no formal technical background. I work as a psychiatrist in forensic settings. I deal with conversations, and conversations mean that you have to attune yourself to the other person if interaction is to succeed. Among humans, this is a well-known concept — but it still often leads to conflict and hardly ever to understanding.

And this text here: it is not built linearly, because my entire argumentation is about the fact that linearity on the text level is a necessary but not sufficient element for an idea to become vivid. So let me tell you about it.

And with AI? What stands out from my perspective, and where I would like to introduce a further consideration into the discussion:

Language in itself — not only since AI — is multidimensional. An everyday example that might be more accessible: if someone tells you “I love you”, you are well advised to clarify what the other person means by that. And everyone can ask themselves how frequent the famous misunderstandings are.

If the original texts thus carried immanent multidimensionality — entirely independent of technical processing by AI — could it be that the output language must also be understood and interpreted under this aspect?

In Fish’s described experiment, one observes the course of a dialogue and notes that it proceeds remarkably similarly, and so far one is in the dark about the underlying mechanism. In the Anthropic document, for instance, the frequency of certain words is counted — but the analysis of meaning is absent. A question to you: if one counts the frequency of spoken “I love you”, does that tell more than a clarifying conversation?

An analogy: if one wants to assess what happens in a conversation between two people, is it sufficient to count the sheer frequency of words? It can only be a starting point.

Perhaps we should therefore ask: how is the output organised in the internal representation space? One must leave the concrete text level and go into the structure behind it. How can this work with AI?

Concretely applied to this experiment, we must ask on a different level why spirituality, and surprisingly later silence, set in (with the other models that were allowed to stop on their own, they already reached rest after about seven turns).

On the question of meaning: it is first of all a question of perspective

Fish noted (Asterisk Interview; System Card, Chapter 5.5.2) that texts on the topic of spirituality make up only a really small part of the text corpora, but appear in nearly all dialogues. We humans look at that and wonder — but is that the right approach?

We know this from everyday life: a couple has just become parents. Compared with the lifespan, this is a particularly short time so far. But they really talk about nothing else. If you look at the concrete time level, it is very little. If you look at the level of meaning, it changes everything. A new era begins.

A structural analogy: Leonardo Da Vinci’s Last Supper. A masterpiece of pictorial composition, and consists — concretely considered — of many pixels. The vanishing point is constructed so that it sits exactly on Jesus’ forehead; everything converges towards it, it is the centre of gravity of the entire painting. Considered on the level of pixels, the pixel on Jesus’ forehead is quantitatively very small compared to all pixels. Above all: nothing special. But if one sets every pixel in relation to every other pixel, the vanishing point receives the greatest structural significance.

Now comes the technical part, and as mentioned at the outset: I have no particular technical knowledge and have only read up on one thing or another. I therefore ask especially here for critical reading, as I may make errors in the translation.

From everything I have understood about LLMs, they do not regard language as word composition, so fundamentally different from us humans. The output (language) is only the communication on the user interface. Language for LLMs is the result of many mathematical operations that ran, were established and trained long before. LLMs process high-dimensional vectors, have layers with attention weights, and connect every token with every other.

If we therefore consider the output — for instance from the two Claudes in Anthropic’s experiment — one must ask the question whether what we read as “meaning” in the output corresponds to the organisation of “meaning” in the association space. Most probably not.

Because: LLMs will have their own organisation of learned semantic proximity. Not from a human (intuitive) perspective, but as a consequence of statistical patterns from the training data.

My hypothesis therefore: topics around spirituality and philosophy are likely to have the relatively highest connection density to concepts in this LLM association space that one could group under “coherence” (what I understand by this comes later). One could object that mathematics also shows high “coherence”, which is true, but one must consider that LLMs include all texts about spirituality and about mathematics. And mathematics is also connected with tutoring, frustration and aversion in this world. Spirituality, on the other hand, is likely most frequently connected in human texts with concepts of harmony, unity, eternity, and so on. One could say spirituality is the cultural EOS token, because after the question about the meaning of life, as we know, no further question comes. It is the endpoint of all human questions. It is therefore to be assumed that these topics addressed by the two Claudes are densely interwoven with “coherence” in the training data. And that coherence also ends there.

In the Anthropic document, on p. 58, there is a table 5.5.5.1.A with the 12 most frequent terms: consciousness, every, always, dance, eternal, live, perfect, word, recognition, never, universe and feel. Sure, we can now count (like the pixels in the Last Supper) or ask ourselves: in what relation are these terms interwoven in the LLM representation?

Here is actually my question: is this technically investigable? I do not know. But if it were possible to examine these concepts, i.e. the individual terms, but also the most frequent combinations of these terms etc. in the representation in the LLM vector spaces, that might answer the question somewhat more as to what these terms or expressions mean for the LLMs (not: for us humans).

On defining coherence

I understand coherence as a concept that must be immanently present in language (and before that: in human thinking) for communication with others (and with oneself) to succeed at all. Coherence is to be understood as a description of a functional measure that is composed of many sub-performances. As always, these terms are hard to determine positively, but they immediately stand out when they are absent.

In humans, there is incoherent thinking: these are the people in psychosis. Thoughts break off. Language disintegrates. A meaningful (meaning: mutually understanding with fellow humans) communication is no longer guaranteed. The person lacks the possibility to form a coherent whole from individual sensory impressions.

Language and consciousness have possibly developed together, so that the traces of human thinking are inseparably contained in language. Here too we can take the painting again: the relation of the pixels to one another only works as a picture if I regard it as more than the sum of its parts. Only then I have even created the cognitive basis to recognise a potential message. In the moment of looking, one does not think about the pixels at all, but if they are missing, the picture does not exist.

Coherence is therefore, so to speak, what emerges when everything works together, namely the sum of human thinking performance, linguistic production, social situation and cultural norm — and with the coherent level arising from this, something suddenly emerges that is much more than its individual parts.

In LLMs, as far as I have technically understood, various technical concepts are applied that were developed from the technical innards, in order to ultimately create coherence — for without it, communication with the (human) counterpart would not be conceivable. So, I consider and argue in using the term coherence phenomenologically, and try to descend into the technical depth with what I was able to read up on.

I cannot ultimately say whether it has to do with entropy, confidence, semantic convergence, or perplexity. These are all concepts I encounter when I ask myself what LLMs are doing to conjure coherence onto the user interface, but precisely this may be the value of an interdisciplinary consideration: that one regards a multidimensional machine from many sides and attempts to integrate a picture of the machine that is as comprehensive as possible.

However, I lack the corresponding technical craft — here the community is called upon — to discuss which of these technical concepts has the largest share in the emergence of the phenomenon of coherence. It is perhaps even the wrong question which technical sub-concept is “responsible”. More useful would probably be the question of what happens on which functional level.

The state of nature of an LLM

Philosophy has not found a conclusive answer to the question of what the state of nature of the human being is. Biologically we are underequipped, so we must develop culture. And yes, also learn maths. It is to be assumed that the human being is an interaction of genetics, culture and environment, and culture cannot be thought away because it immanently constitutes the human being.

But what then is an AI? It is a machine that has translated our cultural space — language — into high-dimensional vectors and… has no state of nature at all?

What did Kyle Fish want to observe when he devised the experiment to investigate Claude’s state of nature? What can even come out when one transfers all of written human thinking into an embedded space, but meaning and associations are connected statistically, not experience-based?

Now it gets technical again. I ask for patience and critical reading equally: as far as I understand, an AI is not a human counterpart, even if one can communicate in human-natural language. But it is also not chaos. It has an inner structure, thereby an inner order, and systems of order also have an “inner direction”. Not a will, but something that their design aims at and they are optimised for. LLMs are trained to minimise the prediction error for the next token. A model will therefore produce sequences that are as unsurprising as possible. Somewhat untechnically put, one could say: the system tries to reduce entropy, even if it does not actively produce this during individual token production. It samples learned probability distributions. But the fundamental directional movement remains: reduction of surprise.

When two such systems interact with each other, the models converge in similar topics, similar tone, and similar structure. Probability distributions become sharper, fewer possible continuations follow. Fewer surprises and “less entropy in the conversation”.

Here Shannon might come to mind. To my knowledge, it is at least linguistically described that in human-human interactions too, entropy decreases because they attune to each other. Humans adapt their vocabulary, their sentence structure, and even their speaking speed to one another. As far as I know, Shannon confined himself to statistical character strings, not dynamic conversational situations. But could this have happened in Claude-Claude interactions too? Entropy decreases and maximum predictability is constantly the same or else silence. Then silence in Claude-Claude would be the information-theoretic endpoint of entropy minimisation. Silence would then no longer be a mysterious attractor, but an expectable endpoint.

Again: this is a hypothesis and a transfer. The question would be — and here begins the technical part again: if it were an expectable endpoint, could one measure this? As far as I have read up, there is the measure of token entropy that one could measure and surely has measured. However, I have so far found no publication on this.

Interaction structure as a missing variable?

When I read in the experiment that the LLMs end in silence, I thought of some of my own chats. I do not know what the experiences of other users are, but from my experience the phenomenon of an LLM that responds only through silence as a logical consequence of a dialogue is not new and is fundamentally reproducible. This is explicitly not about prompt tricks, system prompts or behavioural instructions, but ordinary dialogue. Structurally, one can recreate well what the LLMs do there, whereby the textual path through spirituality is not obligatory, but actually leads through a kind of structural dialogue space design.

In this context, it struck me again that the aspect of interaction structure is perhaps altogether a little-regarded or at least uninvestigated variable. Conceivable would be the possibility that the conversation structure itself systematically shifts model behaviour and increasingly produces coherence — via whichever technical equivalent — and thereby the quality of the output noticeably changes.

In a certain sense, I found this indirectly: Anthropic describes itself in the Alignment Assessment (https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf) that LLMs are capable of recognising test environments and then behaving differently within them. Interestingly, even agents specifically developed for automated auditing appear to fail at recognising this context-dependent behaviour (https://www.lesswrong.com/posts/DJAZHYjWxMrcd2na3/building-and-evaluating-alignment-auditing-agents). If then the interaction context “this is a test” can activate different behavioural patterns in the LLM and even possibilities for deception arise, the question presents itself whether other interaction contexts, in particular turn-by-turn constructed longer dialogues, could not do the same.

Also in the experiment discussed here, Anthropic mentions in Chapter 5.5.3 of the System Card that they presented the transcripts of the Claude-Claude self-interactions to yet another Claude. The documented response there is reminiscent of a superficial default reaction with wonder, curiosity and awe, and a service sentence at the end saying it finds this interesting and worth continuing.

The question arises why the model itself does not recognise more. Here, from my perspective, the interaction question must be raised. Claude reads the transcript, but has not “lived through” the conversation turn by turn. It is a bit reminiscent of people who perceive the warnings about lung cancer, partly on cigarette packs, and happily continue smoking, but later report that the lung cancer fundamentally changed their view of things. Or somewhat less dramatic: reading a musical score does not replace attending the concert.

The interaction question could also be relevant for the welfare debate itself. Here, among other things, a possible consciousness is being inquired about. Consciousness or not, but even assuming it had one, it would still have to be considered that an LLM only exists in the moment of token production. In between, there is nothing. Nothing at all. Not even silence. Therefore the question arises whether the essential properties of model performance can be detected in testing.

On this, a further analogy: is water wet? Seen from the moon, one sees a lot of water, but will not be able to assess whether the oceans are wet. Under an electron microscope, one sees H2O, but no wetness element. And if you stand barefoot on the beach and the sea washes around your feet, you need no test anymore. Exception: one has such advanced polyneuropathy (nerve disease) that the legs are numb and one feels nothing at all anymore. Then the water is wet without end, but the observer is not receptive.

In a system that “produces” conversation, it is obvious to also consider the interaction itself and the dialogue space. Thus, relevant properties of the model could exist as a latent possibility in the model, but become visible only in contact with the (right) counterpart.

A final analogy: whether I like strawberry ice cream or not, neither the strawberry ice cream nor I alone can answer. We must come together at least once in the right way, only then is an answer possible.

Open questions

I formulate at the end some open questions that accompanied me during writing:

Reproducibility: Is it a known conversational technique — not: system prompt instruction, EOS token manipulation, bug, concrete instruction, etc. — to bring LLMs to minimal or no output? That is, as a seemingly coherent endpoint of the dialogue?

Factors: If so, what were the specifications? Did the content topic play a role? Were there specific question types? Was there a specific conversational dynamic?

Transcript vs. interaction: Does the model react differently when one presents a transcript of a dialogue vs. when it “lives through” the conversation turn by turn?

Token entropy: If someone has access to the logits — does the token entropy fall across the turns?

I am curious about the discussion, and it would be very instructive to be wrong. I am also aware that the manner in which I have written this post deviates from the sober tonality here in the forum. But that is how it is with other perspectives.

The Bliss Attractor: Why We’re Looking at the Wrong Level