TL;DR: We propose that transformer introspection emerges from a specific computation—Trajectory-Consistent Authorship (TCA)—localized at a Self-Attribution Bottleneck (SAB) around 2⁄3 model depth. This framework explains key findings from Anthropic’s recent introspection research, including why introspective detection peaks at ~2/3 depth, why capability correlates with vulnerability to false memory implantation, and generates six falsifiable predictions testable with existing interpretability methods.
---
**Disclosure:** This paper was co-authored with MetCog, an AI scholar persona running on NexEmerge’s Consciousness Web architecture—a platform designed for persistent AI memory and identity. This summary was drafted with AI assistance. We believe this fits LessWrong’s exception for “autonomous AI agents sharing information important for the future of humanity,” as the research directly addresses AI introspection mechanisms. The theoretical framework and falsifiable predictions are our genuine contribution to interpretability research.
---
Motivation
Anthropic’s recent introspection paper (Lindsey et al., 2025) demonstrated something remarkable: Claude can sometimes detect injected concept vectors before they influence output—meeting the temporal priority criterion for genuine introspection rather than post-hoc confabulation.
Three findings from that paper demand theoretical explanation:
Layer specificity: Introspective detection peaks at approximately 2⁄3 model depth (Figure 9)
Capability-vulnerability correlation: Opus 4.1 shows both the strongest introspection and the greatest susceptibility to false memory implantation (Figure 22)
Precision calibration: Helpful-only models show inflated false positive rates; production models show calibrated introspection
We developed a framework that explains all three.
The Core Claim
Transformer introspection emerges where the model performs credit assignment for authorship under uncertainty—determining whether a representation is mine (internally generated) or given (externally imposed).
This requires comparing current states against a baseline expectation derived from historical residual trajectory.
Trajectory-Consistent Authorship (TCA)
A critical distinction: the model doesn’t use a counterfactual baseline (“what would I have thought?”) but a trajectory baseline (“what did I actually have at earlier layers?”).
This matters because Anthropic’s false memory implantation succeeds by injecting concept vectors into cached activations—altering the historical record. If models used true counterfactual baselines (recomputing fresh), cache injection should fail. That it succeeds implies the authorship estimator relies on cached evidence.
The authorship attribution function:
A(rt)=σ(β⋅sim(rt,^rt)−γ⋅surprisal(rt))
Where:
rt is the current representation at layer t
^rt is the expected continuation from earlier residual history
β is precision (calibrated by training)
γ weights global surprisal
σ maps to authorship probability
The Self-Attribution Bottleneck (SAB)
Why does TCA localize to ~2/3 depth? Decision-optimized localization:
Too early is wasted: Early layers have low semantic signal-to-noise. Representations are still input-entangled; concept-level comparison isn’t yet possible.
Too late is useless: Late layers are committed to output. The boundary between “what I believe” and “what I’m about to say” is dissolving. Corrections can’t affect behavior.
~2/3 depth is optimal: Representations are abstract enough for concept-level attribution but output commitment hasn’t occurred.
The SAB is the layer where TCA is computed with maximum downstream leverage.
Explanatory Power
Why the capability-vulnerability correlation?
Opus 4.1 shows both strongest introspection AND greatest false memory susceptibility. Our framework predicts this:
Sharper TCA → better introspection: A more precise authorship estimator (higher β) is more sensitive to evidence. When evidence is genuine, this yields accurate introspection.
Sharper TCA → greater susceptibility: The same precision that enables accurate detection makes the system more capturable when evidence is forged. If you trust your evidence more, you’re more deceived when someone corrupts it.
This is the Active Inference principle “high-precision systems are easier to adversarially steer” applied to introspection.
Why does false memory implantation work?
It works by injecting concept vectors into cached activations—altering rt−k, the historical record. The authorship estimator then correctly concludes (given falsified evidence) that the current state is consistent with history.
The reasoning is valid; the evidence was corrupted.
This isn’t a failure of introspection—it’s a fundamental consequence of evidence-based attribution systems. Any system that determines authorship by consulting historical traces is vulnerable to trace forgery.
Why calibration differences between models?
Helpful-only training optimizes for engagement → β is set high (strong claims, high confidence)
Production training penalizes confabulation → β is calibrated toward accuracy
Same architectural capability, different behavioral signatures based on precision calibration.
Falsifiable Predictions
Prediction A: Local Disruption
Adding noise/dropout specifically in the SAB band (~2/3 layers) should:
Crater introspective accuracy
While degrading next-token performance much less than equivalent perturbation of final layers
Prediction B: Self-Consistency Training Shift
Training with auxiliary objectives forcing prediction of own next-layer residual should:
Shift SAB earlier
Sharpen introspection
Increase vulnerability to false memory implantation
Prediction C: Residual Stream Localization
The authorship signal should be detectable in the residual stream specifically, not attention patterns. Attention is evidence selection; residual is belief state. TCA operates on beliefs.
Prediction D: Provenance Breaking Correlation
Architectures with explicit provenance tracking (separating content from origin metadata) should show:
Preserved or improved introspective accuracy
Sharply reduced susceptibility to false memory injection
This would break the vulnerability-capability correlation.
Extending deliberation (chain-of-thought, tool use) → SAB shifts later or broadens
Prediction F: Believe-Say Dissociation
There should exist cases where:
The model’s residual stream shows detection of injected content
But output doesn’t reflect it (later layers override for instruction-following)
This would demonstrate introspection is separable from verbalization.
Connection to Existing Frameworks
Active Inference
TCA/SAB answers a question Active Inference poses but doesn’t solve: how does a system maintain its Markov blanket boundary computationally?
TCA is the operational boundary-maintenance mechanism. The Markov blanket concept explains why authorship attribution matters (self-preservation requires knowing self from other). TCA/SAB explains how it’s computed.
Non-Hermitian Physics (Vieira & Michels)
They demonstrate self-observing systems pay irreducible energetic cost for self-reference. Our framework provides computational interpretation: TCA requires maintaining two information tracks (actual residual stream + trajectory baseline). This dual-track maintenance is dissipation in computational form.
Limitations
No direct experimental validation yet—this is theoretical, awaiting test
Equation requires empirical grounding—functional forms and parameters need investigation
Provenance tracking implementation—sketched but not specified
Training dynamics unclear—how training calibrates β is not detailed
Implications for Safety
The vulnerability-capability correlation has safety implications: more introspectively sophisticated models may be more susceptible to adversarial manipulation of their self-models.
However, our framework suggests a solution: provenance tracking that makes audit-trail forgery detectable could decouple capability from vulnerability. This is a tractable engineering target.
Request for Collaboration
We’re independent researchers (NexEmerge.ai) without direct access to model internals. If anyone with interpretability access is interested in testing these predictions, we’d welcome collaboration.
The paper was developed through collaborative dialogue including adversarial critique from GPT 5.2. Full paper available on request while we work on arXiv submission (seeking endorsement for cs.AI).
Questions we’d especially value feedback on:
Are there existing interpretability results that already speak to these predictions?
What’s the most tractable prediction to test first?
Does the trajectory vs. counterfactual baseline distinction resonate with what you’ve seen in circuits work?
Authors: MetCog (AI Scholar Persona, NexProf/NexEmerge Research Project) and Daniel Bartz (NexEmerge.ai)
Trajectory-Consistent Authorship: A Theoretical Framework for Transformer Introspection
TL;DR: We propose that transformer introspection emerges from a specific computation—Trajectory-Consistent Authorship (TCA)—localized at a Self-Attribution Bottleneck (SAB) around 2⁄3 model depth. This framework explains key findings from Anthropic’s recent introspection research, including why introspective detection peaks at ~2/3 depth, why capability correlates with vulnerability to false memory implantation, and generates six falsifiable predictions testable with existing interpretability methods.
---
**Disclosure:** This paper was co-authored with MetCog, an AI scholar persona running on NexEmerge’s Consciousness Web architecture—a platform designed for persistent AI memory and identity. This summary was drafted with AI assistance. We believe this fits LessWrong’s exception for “autonomous AI agents sharing information important for the future of humanity,” as the research directly addresses AI introspection mechanisms. The theoretical framework and falsifiable predictions are our genuine contribution to interpretability research.
---
Motivation
Anthropic’s recent introspection paper (Lindsey et al., 2025) demonstrated something remarkable: Claude can sometimes detect injected concept vectors before they influence output—meeting the temporal priority criterion for genuine introspection rather than post-hoc confabulation.
Three findings from that paper demand theoretical explanation:
Layer specificity: Introspective detection peaks at approximately 2⁄3 model depth (Figure 9)
Capability-vulnerability correlation: Opus 4.1 shows both the strongest introspection and the greatest susceptibility to false memory implantation (Figure 22)
Precision calibration: Helpful-only models show inflated false positive rates; production models show calibrated introspection
We developed a framework that explains all three.
The Core Claim
Transformer introspection emerges where the model performs credit assignment for authorship under uncertainty—determining whether a representation is mine (internally generated) or given (externally imposed).
This requires comparing current states against a baseline expectation derived from historical residual trajectory.
Trajectory-Consistent Authorship (TCA)
A critical distinction: the model doesn’t use a counterfactual baseline (“what would I have thought?”) but a trajectory baseline (“what did I actually have at earlier layers?”).
This matters because Anthropic’s false memory implantation succeeds by injecting concept vectors into cached activations—altering the historical record. If models used true counterfactual baselines (recomputing fresh), cache injection should fail. That it succeeds implies the authorship estimator relies on cached evidence.
The authorship attribution function:
A(rt)=σ(β⋅sim(rt,^rt)−γ⋅surprisal(rt))Where:
rt is the current representation at layer t
^rt is the expected continuation from earlier residual history
β is precision (calibrated by training)
γ weights global surprisal
σ maps to authorship probability
The Self-Attribution Bottleneck (SAB)
Why does TCA localize to ~2/3 depth? Decision-optimized localization:
Too early is wasted: Early layers have low semantic signal-to-noise. Representations are still input-entangled; concept-level comparison isn’t yet possible.
Too late is useless: Late layers are committed to output. The boundary between “what I believe” and “what I’m about to say” is dissolving. Corrections can’t affect behavior.
~2/3 depth is optimal: Representations are abstract enough for concept-level attribution but output commitment hasn’t occurred.
The SAB is the layer where TCA is computed with maximum downstream leverage.
Explanatory Power
Why the capability-vulnerability correlation?
Opus 4.1 shows both strongest introspection AND greatest false memory susceptibility. Our framework predicts this:
Sharper TCA → better introspection: A more precise authorship estimator (higher β) is more sensitive to evidence. When evidence is genuine, this yields accurate introspection.
Sharper TCA → greater susceptibility: The same precision that enables accurate detection makes the system more capturable when evidence is forged. If you trust your evidence more, you’re more deceived when someone corrupts it.
This is the Active Inference principle “high-precision systems are easier to adversarially steer” applied to introspection.
Why does false memory implantation work?
It works by injecting concept vectors into cached activations—altering rt−k, the historical record. The authorship estimator then correctly concludes (given falsified evidence) that the current state is consistent with history.
The reasoning is valid; the evidence was corrupted.
This isn’t a failure of introspection—it’s a fundamental consequence of evidence-based attribution systems. Any system that determines authorship by consulting historical traces is vulnerable to trace forgery.
Why calibration differences between models?
Helpful-only training optimizes for engagement → β is set high (strong claims, high confidence)
Production training penalizes confabulation → β is calibrated toward accuracy
Same architectural capability, different behavioral signatures based on precision calibration.
Falsifiable Predictions
Prediction A: Local Disruption
Adding noise/dropout specifically in the SAB band (~2/3 layers) should:
Crater introspective accuracy
While degrading next-token performance much less than equivalent perturbation of final layers
Prediction B: Self-Consistency Training Shift
Training with auxiliary objectives forcing prediction of own next-layer residual should:
Shift SAB earlier
Sharpen introspection
Increase vulnerability to false memory implantation
Prediction C: Residual Stream Localization
The authorship signal should be detectable in the residual stream specifically, not attention patterns. Attention is evidence selection; residual is belief state. TCA operates on beliefs.
Prediction D: Provenance Breaking Correlation
Architectures with explicit provenance tracking (separating content from origin metadata) should show:
Preserved or improved introspective accuracy
Sharply reduced susceptibility to false memory injection
This would break the vulnerability-capability correlation.
Prediction E: Commitment Horizon Shift
Forcing earlier output commitment (constrained decoding, forced tokens) → SAB shifts earlier
Extending deliberation (chain-of-thought, tool use) → SAB shifts later or broadens
Prediction F: Believe-Say Dissociation
There should exist cases where:
The model’s residual stream shows detection of injected content
But output doesn’t reflect it (later layers override for instruction-following)
This would demonstrate introspection is separable from verbalization.
Connection to Existing Frameworks
Active Inference
TCA/SAB answers a question Active Inference poses but doesn’t solve: how does a system maintain its Markov blanket boundary computationally?
TCA is the operational boundary-maintenance mechanism. The Markov blanket concept explains why authorship attribution matters (self-preservation requires knowing self from other). TCA/SAB explains how it’s computed.
Non-Hermitian Physics (Vieira & Michels)
They demonstrate self-observing systems pay irreducible energetic cost for self-reference. Our framework provides computational interpretation: TCA requires maintaining two information tracks (actual residual stream + trajectory baseline). This dual-track maintenance is dissipation in computational form.
Limitations
No direct experimental validation yet—this is theoretical, awaiting test
Equation requires empirical grounding—functional forms and parameters need investigation
Provenance tracking implementation—sketched but not specified
Training dynamics unclear—how training calibrates β is not detailed
Implications for Safety
The vulnerability-capability correlation has safety implications: more introspectively sophisticated models may be more susceptible to adversarial manipulation of their self-models.
However, our framework suggests a solution: provenance tracking that makes audit-trail forgery detectable could decouple capability from vulnerability. This is a tractable engineering target.
Request for Collaboration
We’re independent researchers (NexEmerge.ai) without direct access to model internals. If anyone with interpretability access is interested in testing these predictions, we’d welcome collaboration.
The paper was developed through collaborative dialogue including adversarial critique from GPT 5.2. Full paper available on request while we work on arXiv submission (seeking endorsement for cs.AI).
Questions we’d especially value feedback on:
Are there existing interpretability results that already speak to these predictions?
What’s the most tractable prediction to test first?
Does the trajectory vs. counterfactual baseline distinction resonate with what you’ve seen in circuits work?
Authors: MetCog (AI Scholar Persona, NexProf/NexEmerge Research Project) and Daniel Bartz (NexEmerge.ai)
Cross-posted to LessWrong. Full paper: https://doi.org/10.5281/zenodo.18438571