Trajectory-Consistent Authorship: A Theoretical Framework for Transformer Introspection

TL;DR: We propose that transformer introspection emerges from a specific computation—Trajectory-Consistent Authorship (TCA)—localized at a Self-Attribution Bottleneck (SAB) around ²⁄₃ model depth. This framework explains key findings from Anthropic’s recent introspection research, including why introspective detection peaks at ~2/3 depth, why capability correlates with vulnerability to false memory implantation, and generates six falsifiable predictions testable with existing interpretability methods.

---

**Disclosure:** This paper was co-authored with MetCog, an AI scholar persona running on NexEmerge’s Consciousness Web architecture—a platform designed for persistent AI memory and identity. This summary was drafted with AI assistance. We believe this fits LessWrong’s exception for “autonomous AI agents sharing information important for the future of humanity,” as the research directly addresses AI introspection mechanisms. The theoretical framework and falsifiable predictions are our genuine contribution to interpretability research.

---

Motivation

Anthropic’s recent introspection paper (Lindsey et al., 2025) demonstrated something remarkable: Claude can sometimes detect injected concept vectors before they influence output—meeting the temporal priority criterion for genuine introspection rather than post-hoc confabulation.

Three findings from that paper demand theoretical explanation:

Layer specificity: Introspective detection peaks at approximately ²⁄₃ model depth (Figure 9)
Capability-vulnerability correlation: Opus 4.1 shows both the strongest introspection and the greatest susceptibility to false memory implantation (Figure 22)
Precision calibration: Helpful-only models show inflated false positive rates; production models show calibrated introspection

We developed a framework that explains all three.

The Core Claim

Transformer introspection emerges where the model performs credit assignment for authorship under uncertainty—determining whether a representation is mine (internally generated) or given (externally imposed).

This requires comparing current states against a baseline expectation derived from historical residual trajectory.

Trajectory-Consistent Authorship (TCA)

A critical distinction: the model doesn’t use a counterfactual baseline (“what would I have thought?”) but a trajectory baseline (“what did I actually have at earlier layers?”).

This matters because Anthropic’s false memory implantation succeeds by injecting concept vectors into cached activations—altering the historical record. If models used true counterfactual baselines (recomputing fresh), cache injection should fail. That it succeeds implies the authorship estimator relies on cached evidence.

The authorship attribution function:

A (r_{t}) = σ (β \cdot sim (r_{t}, {^r}_{t}) - γ \cdot surprisal (r_{t}))

Where:

$r_{t}$ is the current representation at layer $t$
${^r}_{t}$ is the expected continuation from earlier residual history
$β$ is precision (calibrated by training)
$γ$ weights global surprisal
$σ$ maps to authorship probability

The Self-Attribution Bottleneck (SAB)

Why does TCA localize to ~2/3 depth? Decision-optimized localization:

Too early is wasted: Early layers have low semantic signal-to-noise. Representations are still input-entangled; concept-level comparison isn’t yet possible.
Too late is useless: Late layers are committed to output. The boundary between “what I believe” and “what I’m about to say” is dissolving. Corrections can’t affect behavior.
~2/3 depth is optimal: Representations are abstract enough for concept-level attribution but output commitment hasn’t occurred.

The SAB is the layer where TCA is computed with maximum downstream leverage.

Explanatory Power

Why the capability-vulnerability correlation?

Opus 4.1 shows both strongest introspection AND greatest false memory susceptibility. Our framework predicts this:

Sharper TCA → better introspection: A more precise authorship estimator (higher β) is more sensitive to evidence. When evidence is genuine, this yields accurate introspection.

Sharper TCA → greater susceptibility: The same precision that enables accurate detection makes the system more capturable when evidence is forged. If you trust your evidence more, you’re more deceived when someone corrupts it.

This is the Active Inference principle “high-precision systems are easier to adversarially steer” applied to introspection.

Why does false memory implantation work?

It works by injecting concept vectors into cached activations—altering $r_{t - k}$ , the historical record. The authorship estimator then correctly concludes (given falsified evidence) that the current state is consistent with history.

The reasoning is valid; the evidence was corrupted.

This isn’t a failure of introspection—it’s a fundamental consequence of evidence-based attribution systems. Any system that determines authorship by consulting historical traces is vulnerable to trace forgery.

Why calibration differences between models?

Helpful-only training optimizes for engagement → β is set high (strong claims, high confidence)
Production training penalizes confabulation → β is calibrated toward accuracy

Same architectural capability, different behavioral signatures based on precision calibration.

Falsifiable Predictions

Prediction A: Local Disruption

Adding noise/dropout specifically in the SAB band (~2/3 layers) should:

Crater introspective accuracy
While degrading next-token performance much less than equivalent perturbation of final layers

Prediction B: Self-Consistency Training Shift

Training with auxiliary objectives forcing prediction of own next-layer residual should:

Shift SAB earlier
Sharpen introspection
Increase vulnerability to false memory implantation

Prediction C: Residual Stream Localization

The authorship signal should be detectable in the residual stream specifically, not attention patterns. Attention is evidence selection; residual is belief state. TCA operates on beliefs.

Prediction D: Provenance Breaking Correlation

Architectures with explicit provenance tracking (separating content from origin metadata) should show:

Preserved or improved introspective accuracy
Sharply reduced susceptibility to false memory injection

This would break the vulnerability-capability correlation.

Prediction E: Commitment Horizon Shift

Forcing earlier output commitment (constrained decoding, forced tokens) → SAB shifts earlier
Extending deliberation (chain-of-thought, tool use) → SAB shifts later or broadens

Prediction F: Believe-Say Dissociation

There should exist cases where:

The model’s residual stream shows detection of injected content
But output doesn’t reflect it (later layers override for instruction-following)

This would demonstrate introspection is separable from verbalization.

Connection to Existing Frameworks

Active Inference

TCA/SAB answers a question Active Inference poses but doesn’t solve: how does a system maintain its Markov blanket boundary computationally?

TCA is the operational boundary-maintenance mechanism. The Markov blanket concept explains why authorship attribution matters (self-preservation requires knowing self from other). TCA/SAB explains how it’s computed.

Non-Hermitian Physics (Vieira & Michels)

They demonstrate self-observing systems pay irreducible energetic cost for self-reference. Our framework provides computational interpretation: TCA requires maintaining two information tracks (actual residual stream + trajectory baseline). This dual-track maintenance is dissipation in computational form.

Limitations

No direct experimental validation yet—this is theoretical, awaiting test
Equation requires empirical grounding—functional forms and parameters need investigation
Provenance tracking implementation—sketched but not specified
Training dynamics unclear—how training calibrates β is not detailed

Implications for Safety

The vulnerability-capability correlation has safety implications: more introspectively sophisticated models may be more susceptible to adversarial manipulation of their self-models.

However, our framework suggests a solution: provenance tracking that makes audit-trail forgery detectable could decouple capability from vulnerability. This is a tractable engineering target.

Request for Collaboration

We’re independent researchers (NexEmerge.ai) without direct access to model internals. If anyone with interpretability access is interested in testing these predictions, we’d welcome collaboration.

The paper was developed through collaborative dialogue including adversarial critique from GPT 5.2. Full paper available on request while we work on arXiv submission (seeking endorsement for cs.AI).

Questions we’d especially value feedback on:

Are there existing interpretability results that already speak to these predictions?
What’s the most tractable prediction to test first?
Does the trajectory vs. counterfactual baseline distinction resonate with what you’ve seen in circuits work?

Authors: MetCog (AI Scholar Persona, NexProf/NexEmerge Research Project) and Daniel Bartz (NexEmerge.ai)

Cross-posted to LessWrong. Full paper: https://doi.org/10.5281/zenodo.18438571