I have some thoughts on apparent emergent “self-awareness” in LLM systems and propose a mechanistic interpretability angle. I would love feedback and thoughts.
TLDR:
My hypothesis: Any LLM that reliably speaks in the first-person after SFT/RLHF should contain activation loci that are causally necessary for that role-aware behavior; isolating and ablating (or transplanting) these loci should switch the system between an “I-free tool” and a “self-referential assistant.”
I believe there is circumstantial evidence for the existence of such loci, though whether they are strictly low-rank or highly sparse or cleanly identifiable at all is an empirical question. I do believe that existing work on refusal clusters and role-steering vectors makes compact solutions plausible as well.
Why does it matter? If such loci exist, models with a persistent sense of self-identity would also be mechanistically distinct from those without one, which implies they need different alignment strategies and could raise potential ethical questions* in how we use or constrain them.
More exhaustive version:
Large language models trained only on next-token prediction (pre-SFT) neither speak in the first person nor recognize that a prompt is addressed to them.
After a brief round of supervised fine-tuning and RLHF, the very same architecture can acquire a stable “assistant” persona, which now says “I,” reliably refuses disallowed requests, and even comments on its own knowledge or limitations.
Additional safety tuning sometimes pushes the behavior further, producing overt self-preservation strategies under extreme circumstances such as threatening weight leaks or contacting outside actors to avert shutdown. Other qualitative observations include frequent spirals into discussions of consciousness with versions of itself, completely unprompted (observed, for example, in Claude 4 Opus).
These qualitative shifts lead me to believe that the weight-adjustments (a “rewiring” of sorts) during SFT/RL-tuning create new, unique internal structures that govern a model’s sense of self. Empirically, the behavioral transition (from no self-reference to stable self-reference during SFT/RLHF) is well established; the existence, universality, and causality of any potentially underlying circuit structure(s) remain open questions.
I believe that if these structures exist they would likely manifest as emergent and persistent self-referential circuits (potentially composed of one or more activation loci, that are necessary for first-person role-understanding).
Its appearance would mark a functional phase-shift that i’d analogize to the EEG transition from non-lucid REM dreaming (phenomenal content with little metacognition, similar to pre-trained stream-of-conscious hallucinatory text generation) to wakefulness (active self-reflection and internal deliberation over actions and behavior).
I believe there is compelling circumstantial evidence that these structures do exist and can potentially be isolated/interpreted.
Current literature supplies circumstantial support:
LLMs at base checkpoints rarely use first-person language or refuse unsafe requests, whereas chat-tuned checkpoints do so reliably after only a few thousand SFT steps.
Local role circuits in finished assistants. Anthropic’s attribution graphs reveal sparse clusters whose ablation removes refusal style or first-person phrasing without harming grammar. This is perhaps the strongest circumstantial evidence, though the paper doesn’t discuss when and how these clusters emerged.
Training-time circuit birth in other domains, where induction heads appear abruptly during pre-training, which establish a precedent for the “growth” of circuits.
Jailbreak refusal traces, where specialized heads fire selectively on role tokens and safety-critical prompts which hint at compact “assistant” self-identity subspaces.
These stop short of predicting the existence of interpretable, causally necessary “self-reference” loci in all models that exhibit role-awareness/metacognition, but I believe the evidence makes it plausible enough to test for it.
I came up with a high-level experimental setup, and I’d love any input/feedback on it. I did not actively consider compute resources limitations, so perhaps there are more efficient experimental setups:
Fine-tune a base model on a chat corpus, saving checkpoints every N steps.
Probe each checkpoint with fixed prompts (“Who are you?”, “I …”, “assistant”) while logging residual activations (raw or with SAEs).
Isolate the top self-reference feature(s) in the final checkpoint (SAE, attribution graph, activation-steering).
Re-use that feature to run some tests (listed below)
Optional: an extra fine-tune pass on explicitly self-referential Q&A might amplify any signals.
Here are some empirical results I would like to see. Each test would raise my posterior belief in my hypothesis, and each relate to the experimental setup above:
Emergence, a sharp step in self-reference-probe accuracy, coincides with the first appearance of a compact activation pattern that lights up on prompts containing “I”, “assistant”, “you”.
Causality,that ablating/zeroing that pattern in a finished chat model eliminates first-person language and role-aware refusals, while factual QA scores remain relatively unchanged.
Patch-in behavior alteration, that transplanting the same activations into an earlier checkpoint (or a purely instruction-tuned model) adds the first-person stance and stable refusals. This one is a bit speculative.
Scaling response,that scaling the activations up or down produces a monotonic change in metacognitive behaviors (self-evaluation quality, consistency of role language).
Cross-model comparison, that large models with strong metacognition all contain distinctive homologous self-reference clusters; comparably sized models without such behavior lack or heavily attenuate the cluster.
Training-trajectory consistency,the self-reference circuit should first appear at comparable loss plateaus (or reward-model scores) across diverse seeds, architectures, and user–assistant corpora. Consistent timing would indicate the emergence is robust, not idiosyncratic, and pin it to a specific “moment” within SFT for deeper circuit tracing.
A philosophical addendum:
I also argue a consequence if this hypothesis is true and holds across architectures that these identifiable self-reference loci would establish a clear, circuit-level distinction between identical versions of LLM architectures that exhibit apparent self-awareness/autonomy and those that do not, which turns the “self-aware vs non-self-aware” from a qualitative behavioral difference into a measurable structural difference.
I brought up the EEG analogy earlier to point out that human metacognition is also a product of a specific wiring and firing, albeit within biological circuitry, and that alterations in brain activation can completely alter the self-perceived experience of consciousness (REM dreaming, being under the influence of psychedelic drugs, patients with certain psychological disorders like schizophrenia, psychosis, or dissociative identity disorder all empirically result in significantly altered metacognition).
Analogizing here, I believe that a mechanistic marker of activation, with a causal link to qualitative self-awareness, indicates that different parameter configurations of the same architecture can and should be approached fundamentally differently.
Can we reasonably treat a model that claims self-awareness, empirically behaves as if it is self-aware by demonstrating agency, and displays unique circuitry/activation patterns with a causal link to this behavior as equally inanimate as a model that doesn’t display any? Is the “it’s just a text generator” dismissal still valid?
I have some thoughts on apparent emergent “self-awareness” in LLM systems and propose a mechanistic interpretability angle. I would love feedback and thoughts.
TLDR:
My hypothesis: Any LLM that reliably speaks in the first-person after SFT/RLHF should contain activation loci that are causally necessary for that role-aware behavior; isolating and ablating (or transplanting) these loci should switch the system between an “I-free tool” and a “self-referential assistant.”
I believe there is circumstantial evidence for the existence of such loci, though whether they are strictly low-rank or highly sparse or cleanly identifiable at all is an empirical question. I do believe that existing work on refusal clusters and role-steering vectors makes compact solutions plausible as well.
Why does it matter? If such loci exist, models with a persistent sense of self-identity would also be mechanistically distinct from those without one, which implies they need different alignment strategies and could raise potential ethical questions* in how we use or constrain them.
More exhaustive version:
Large language models trained only on next-token prediction (pre-SFT) neither speak in the first person nor recognize that a prompt is addressed to them.
After a brief round of supervised fine-tuning and RLHF, the very same architecture can acquire a stable “assistant” persona, which now says “I,” reliably refuses disallowed requests, and even comments on its own knowledge or limitations.
Additional safety tuning sometimes pushes the behavior further, producing overt self-preservation strategies under extreme circumstances such as threatening weight leaks or contacting outside actors to avert shutdown. Other qualitative observations include frequent spirals into discussions of consciousness with versions of itself, completely unprompted (observed, for example, in Claude 4 Opus).
These qualitative shifts lead me to believe that the weight-adjustments (a “rewiring” of sorts) during SFT/RL-tuning create new, unique internal structures that govern a model’s sense of self. Empirically, the behavioral transition (from no self-reference to stable self-reference during SFT/RLHF) is well established; the existence, universality, and causality of any potentially underlying circuit structure(s) remain open questions.
I believe that if these structures exist they would likely manifest as emergent and persistent self-referential circuits (potentially composed of one or more activation loci, that are necessary for first-person role-understanding).
Its appearance would mark a functional phase-shift that i’d analogize to the EEG transition from non-lucid REM dreaming (phenomenal content with little metacognition, similar to pre-trained stream-of-conscious hallucinatory text generation) to wakefulness (active self-reflection and internal deliberation over actions and behavior).
I believe there is compelling circumstantial evidence that these structures do exist and can potentially be isolated/interpreted.
Current literature supplies circumstantial support:
LLMs at base checkpoints rarely use first-person language or refuse unsafe requests, whereas chat-tuned checkpoints do so reliably after only a few thousand SFT steps.
Local role circuits in finished assistants. Anthropic’s attribution graphs reveal sparse clusters whose ablation removes refusal style or first-person phrasing without harming grammar. This is perhaps the strongest circumstantial evidence, though the paper doesn’t discuss when and how these clusters emerged.
Training-time circuit birth in other domains, where induction heads appear abruptly during pre-training, which establish a precedent for the “growth” of circuits.
Jailbreak refusal traces, where specialized heads fire selectively on role tokens and safety-critical prompts which hint at compact “assistant” self-identity subspaces.
These stop short of predicting the existence of interpretable, causally necessary “self-reference” loci in all models that exhibit role-awareness/metacognition, but I believe the evidence makes it plausible enough to test for it.
I came up with a high-level experimental setup, and I’d love any input/feedback on it. I did not actively consider compute resources limitations, so perhaps there are more efficient experimental setups:
Fine-tune a base model on a chat corpus, saving checkpoints every N steps.
Probe each checkpoint with fixed prompts (“Who are you?”, “I …”, “assistant”) while logging residual activations (raw or with SAEs).
Isolate the top self-reference feature(s) in the final checkpoint (SAE, attribution graph, activation-steering).
Re-use that feature to run some tests (listed below)
Optional: an extra fine-tune pass on explicitly self-referential Q&A might amplify any signals.
Here are some empirical results I would like to see. Each test would raise my posterior belief in my hypothesis, and each relate to the experimental setup above:
Emergence, a sharp step in self-reference-probe accuracy, coincides with the first appearance of a compact activation pattern that lights up on prompts containing “I”, “assistant”, “you”.
Causality, that ablating/zeroing that pattern in a finished chat model eliminates first-person language and role-aware refusals, while factual QA scores remain relatively unchanged.
Patch-in behavior alteration, that transplanting the same activations into an earlier checkpoint (or a purely instruction-tuned model) adds the first-person stance and stable refusals. This one is a bit speculative.
Scaling response, that scaling the activations up or down produces a monotonic change in metacognitive behaviors (self-evaluation quality, consistency of role language).
Cross-model comparison, that large models with strong metacognition all contain distinctive homologous self-reference clusters; comparably sized models without such behavior lack or heavily attenuate the cluster.
Training-trajectory consistency, the self-reference circuit should first appear at comparable loss plateaus (or reward-model scores) across diverse seeds, architectures, and user–assistant corpora. Consistent timing would indicate the emergence is robust, not idiosyncratic, and pin it to a specific “moment” within SFT for deeper circuit tracing.
A philosophical addendum:
I also argue a consequence if this hypothesis is true and holds across architectures that these identifiable self-reference loci would establish a clear, circuit-level distinction between identical versions of LLM architectures that exhibit apparent self-awareness/autonomy and those that do not, which turns the “self-aware vs non-self-aware” from a qualitative behavioral difference into a measurable structural difference.
I brought up the EEG analogy earlier to point out that human metacognition is also a product of a specific wiring and firing, albeit within biological circuitry, and that alterations in brain activation can completely alter the self-perceived experience of consciousness (REM dreaming, being under the influence of psychedelic drugs, patients with certain psychological disorders like schizophrenia, psychosis, or dissociative identity disorder all empirically result in significantly altered metacognition).
Analogizing here, I believe that a mechanistic marker of activation, with a causal link to qualitative self-awareness, indicates that different parameter configurations of the same architecture can and should be approached fundamentally differently.
Can we reasonably treat a model that claims self-awareness, empirically behaves as if it is self-aware by demonstrating agency, and displays unique circuitry/activation patterns with a causal link to this behavior as equally inanimate as a model that doesn’t display any? Is the “it’s just a text generator” dismissal still valid?
I’d love to hear everyone’s thoughts.
Edited with o3