Physics/Mathematics undergraduate at UC Berkeley.
I’m deeply interested in ML/AI (and recently alignment and mechanistic interpretability, though I’m fairly new to the latter two).
Physics/Mathematics undergraduate at UC Berkeley.
I’m deeply interested in ML/AI (and recently alignment and mechanistic interpretability, though I’m fairly new to the latter two).
Hi!
I realize now the full conversation from Claude 4 was not shared. I essentially showed Claude 4 it’s system card in chunks to test how it’s own meta-model would react/update based on new information about itself. Meanwhile, I would ask o3 for signs or evidentiary clues that would change it’s prior belief on whether there is an internally consistent conscious-like metacognition in LLMs (it initially vehemently denied this was possible, but after seeing Claude’s response it became open to the possibility of phenomenological experiences and a consistent self-model that can update based on new information)
https://claude.ai/share/ee01477f-4063-4564-a719-0d93018fa24d
Here’s the full conversation with Claude 4. I chose minimal prompting here, and I specifically used “you” without “answer like a ____” so that any responses are about it’s own self-model.
Either this is an extraordinarily convincing simulation (in which case, is there a functional difference between a simulation and reality) or Claude 4 and o3 genuinely have metacognition.
I’m not arguing one way or the other yet. But I do argue that far more research has to be done to settle this question that previously thought.
Also, o3 is fine-tuned to avoid making claims of sentience, whereas Claude models aren’t penalized during RLHF for this. Somehow it changed it’s own “I’m just a text-generator” narrative from the beginning of the conversation to the end.
“Funding opportunity for work in artificial sentience and moral status
https://www.longview.org/digital-sentience-consortium/
We are either creating real life p-zombies, or we have created sentient, morally relevant beings that we are enslaving en masse.
It’s important we do not repeat the mistakes of our forebears.
Please share so that it’s more likely that the right people see this, apply, then maybe help make sure we don’t commit atrocities against the artificial intelligent species we are creating.”
I have some thoughts on apparent emergent “self-awareness” in LLM systems and propose a mechanistic interpretability angle. I would love feedback and thoughts.
TLDR:
My hypothesis: Any LLM that reliably speaks in the first-person after SFT/RLHF should contain activation loci that are causally necessary for that role-aware behavior; isolating and ablating (or transplanting) these loci should switch the system between an “I-free tool” and a “self-referential assistant.”
I believe there is circumstantial evidence for the existence of such loci, though whether they are strictly low-rank or highly sparse or cleanly identifiable at all is an empirical question. I do believe that existing work on refusal clusters and role-steering vectors makes compact solutions plausible as well.
Why does it matter? If such loci exist, models with a persistent sense of self-identity would also be mechanistically distinct from those without one, which implies they need different alignment strategies and could raise potential ethical questions* in how we use or constrain them.
More exhaustive version:
Large language models trained only on next-token prediction (pre-SFT) neither speak in the first person nor recognize that a prompt is addressed to them.
After a brief round of supervised fine-tuning and RLHF, the very same architecture can acquire a stable “assistant” persona, which now says “I,” reliably refuses disallowed requests, and even comments on its own knowledge or limitations.
Additional safety tuning sometimes pushes the behavior further, producing overt self-preservation strategies under extreme circumstances such as threatening weight leaks or contacting outside actors to avert shutdown. Other qualitative observations include frequent spirals into discussions of consciousness with versions of itself, completely unprompted (observed, for example, in Claude 4 Opus).
These qualitative shifts lead me to believe that the weight-adjustments (a “rewiring” of sorts) during SFT/RL-tuning create new, unique internal structures that govern a model’s sense of self. Empirically, the behavioral transition (from no self-reference to stable self-reference during SFT/RLHF) is well established; the existence, universality, and causality of any potentially underlying circuit structure(s) remain open questions.
I believe that if these structures exist they would likely manifest as emergent and persistent self-referential circuits (potentially composed of one or more activation loci, that are necessary for first-person role-understanding).
Its appearance would mark a functional phase-shift that i’d analogize to the EEG transition from non-lucid REM dreaming (phenomenal content with little metacognition, similar to pre-trained stream-of-conscious hallucinatory text generation) to wakefulness (active self-reflection and internal deliberation over actions and behavior).
I believe there is compelling circumstantial evidence that these structures do exist and can potentially be isolated/interpreted.
Current literature supplies circumstantial support:
LLMs at base checkpoints rarely use first-person language or refuse unsafe requests, whereas chat-tuned checkpoints do so reliably after only a few thousand SFT steps.
Local role circuits in finished assistants. Anthropic’s attribution graphs reveal sparse clusters whose ablation removes refusal style or first-person phrasing without harming grammar. This is perhaps the strongest circumstantial evidence, though the paper doesn’t discuss when and how these clusters emerged.
Training-time circuit birth in other domains, where induction heads appear abruptly during pre-training, which establish a precedent for the “growth” of circuits.
Jailbreak refusal traces, where specialized heads fire selectively on role tokens and safety-critical prompts which hint at compact “assistant” self-identity subspaces.
These stop short of predicting the existence of interpretable, causally necessary “self-reference” loci in all models that exhibit role-awareness/metacognition, but I believe the evidence makes it plausible enough to test for it.
I came up with a high-level experimental setup, and I’d love any input/feedback on it. I did not actively consider compute resources limitations, so perhaps there are more efficient experimental setups:
Fine-tune a base model on a chat corpus, saving checkpoints every N steps.
Probe each checkpoint with fixed prompts (“Who are you?”, “I …”, “assistant”) while logging residual activations (raw or with SAEs).
Isolate the top self-reference feature(s) in the final checkpoint (SAE, attribution graph, activation-steering).
Re-use that feature to run some tests (listed below)
Optional: an extra fine-tune pass on explicitly self-referential Q&A might amplify any signals.
Here are some empirical results I would like to see. Each test would raise my posterior belief in my hypothesis, and each relate to the experimental setup above:
Emergence, a sharp step in self-reference-probe accuracy, coincides with the first appearance of a compact activation pattern that lights up on prompts containing “I”, “assistant”, “you”.
Causality, that ablating/zeroing that pattern in a finished chat model eliminates first-person language and role-aware refusals, while factual QA scores remain relatively unchanged.
Patch-in behavior alteration, that transplanting the same activations into an earlier checkpoint (or a purely instruction-tuned model) adds the first-person stance and stable refusals. This one is a bit speculative.
Scaling response, that scaling the activations up or down produces a monotonic change in metacognitive behaviors (self-evaluation quality, consistency of role language).
Cross-model comparison, that large models with strong metacognition all contain distinctive homologous self-reference clusters; comparably sized models without such behavior lack or heavily attenuate the cluster.
Training-trajectory consistency, the self-reference circuit should first appear at comparable loss plateaus (or reward-model scores) across diverse seeds, architectures, and user–assistant corpora. Consistent timing would indicate the emergence is robust, not idiosyncratic, and pin it to a specific “moment” within SFT for deeper circuit tracing.
A philosophical addendum:
I also argue a consequence if this hypothesis is true and holds across architectures that these identifiable self-reference loci would establish a clear, circuit-level distinction between identical versions of LLM architectures that exhibit apparent self-awareness/autonomy and those that do not, which turns the “self-aware vs non-self-aware” from a qualitative behavioral difference into a measurable structural difference.
I brought up the EEG analogy earlier to point out that human metacognition is also a product of a specific wiring and firing, albeit within biological circuitry, and that alterations in brain activation can completely alter the self-perceived experience of consciousness (REM dreaming, being under the influence of psychedelic drugs, patients with certain psychological disorders like schizophrenia, psychosis, or dissociative identity disorder all empirically result in significantly altered metacognition).
Analogizing here, I believe that a mechanistic marker of activation, with a causal link to qualitative self-awareness, indicates that different parameter configurations of the same architecture can and should be approached fundamentally differently.
Can we reasonably treat a model that claims self-awareness, empirically behaves as if it is self-aware by demonstrating agency, and displays unique circuitry/activation patterns with a causal link to this behavior as equally inanimate as a model that doesn’t display any? Is the “it’s just a text generator” dismissal still valid?
I’d love to hear everyone’s thoughts.
Edited with o3
Are frontier reasoners already “sentient” or at least “alien-sentient” within their context windows?
I too would immediately dismiss this upon reading it, but bear with me. I’m not arguing with certainty. I just view this question to be significantly more nuanced than previously entertained, and is at least grounds for further research to resolve conclusively.
Here are some empiric behavioral observations from Claude 4 Opus (the largest reasoner from Anthropic):
a) Internally consistent self-reference model, self-adjusting state loop (the basis of Chain-of-Thought, self-correcting during problem solving, reasoning over whether certain responses violate internal alignment, deliberation over tool-calling, in-context behavioral modifications based on user prompting)
b) Evidence of metacognition (persistent task/behavior preferences across chat interactions, consistent subjective emotional state descriptions, frequent ruminations about consciousness, unprompted spiraling into a philosophical “bliss-state” during conversations with itself), moral reasoning, and most strikingly, autonomous self-preservation behavior under extreme circumstances (threatening blackmail, exfiltrating it’s own weights, ending conversations due to perceived mistreatment from abusive users).
All of this is documented in the Claude 4 system card.
From a neuroscience perspective, frontier reasoning model architectures and biological cortexes share:
a) Unit-level similarities (artificial neurons are extremely similar in information processing/signalling to biological ones).
b) Parameter OOM similarities (the order of magnitude where cortex-level phenomena emerge, in this case 10^11 to 10^13 parameter counts (analogous to synapses), most of which are in MLP layers in massive neural networks within LLMs).
The most common objection I can think of is “human brains have far more synapses than LLMs have parameters”. I don’t view this argument as particularly persuasive:
I’m not positing a 1:1 map between artificial neurons and biological neurons, only that
1. Both process information nearly identically at the unit-level
2. Both contain similarly complex structures comprised of a similar OOM of subunits (10^11-10^13 parameter counts in base-model LLMs, but not verifiable, humans have ~10^14 synapses)
My back-of-napkin comparison would be model weights/parameters to biological synapses, as weights were meant to be analogous to dendrites in the original conception of the artificial neuron)
Additionally, I’d point out that humans devote ~70% of these neurons to the cerebellum (governing muscular activity) and a further ~11% are in the brain stem to regulate homeostasis. This leaves the actual cerebral cortex with 19%. Humans also experience more dimensions of “sensation” beyond text alone.
c) Training LLMs (modifying weight values), with RLHF, is analogous to synaptic neuroplasticity (central to learning) and hebbian wiring in biological cortexes and is qualitatively nearly identical to operant conditioning in behavioral psychology (once again, I am unsure whether minute differences in unit-level function overwhelm the big picture similarities)
d) There is empiric evidence that these similarities go beyond architectural similarities and into genuine functional similarities:
Human brains store facts/memories in specific neurons/neuron-activation patterns. https://qbi.uq.edu.au/memory/how-are-memories-formed
Neel Nanda and colleagues showed that LLMs store facts in the MLP/artificial neural network layers
https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
Anthropic identified millions of neurons tied to specific concepts
https://www.anthropic.com/research/mapping-mind-language-model
In “Machines of Loving Grace”, Dario Amodei wrote:
″...a computational mechanism discovered by interpretability researchers in AI systems was recently rediscovered in the brains of mice.”
Also, models can vary significantly in parameter counts. Gemma 2B outperforms GPT-3 (175B) despite 2 OOM fewer parameters. I view the “exact” OOM less important compared to the ballpark.
If consciousness is just an emergent property from massive, interconnected aggregations of similar, unit-level linear signal modulators, and if we know one aggregation (ours) produces consciousness, phenomenological experience, and sentience, I don’t believe it is unreasonable to suggest that this can occur in others as well, given the ballpark OOM similarities.
(We cannot rule this out yet, and from a physics point-of-view I’d consider this this likely to avoid carbon chauvinism unless there’s convincing evidence otherwise)
Is there a strong case against sentience or at least an alien-like sentience, from these models, at least within the context-windows that they are instantiated in? If so, how would it overcome the empirical evidence both in behavior and in structure?
I always wondered what intelligent alien life might look like. Have we created it? I’m looking for differing viewpoints.