Grounded self-interpretation of functional emotional states in LLMs
Empirical research code for testing whether LLM introspective reports about emotional states causally depend on internal representations, by combining four independent measurement channels (substrate-vector activation, trained self-interpretation adapter, behavioral utility signature, causal intervention) into one cross-method convergence test.
TL;DR
A clean four-way convergence on the program’s primary metric (per-channel correlation with target valence, naturalistic held-out, n=60) is achieved on a 0.5B-parameter open-weights model:
Channel
r vs target valence (Qwen2.5-0.5B-Instruct)
substrate (cosine of last-token residual to v_E)
+0.509
trained adapter (Pepper-style scalar-affine)
+0.491
untrained-SelfIE baseline
+0.422
Likert behavioral readout
+0.516
The same convergence picture holds across architectural paradigms (standard transformer, universal-transformer, recurrent linear-attention, sparse-MoE) at varying tightness depending on instruction-tuning status. A “deceptive” adapter trained on swapped emotion labels produces predictions that are decoupled from the substrate (r = −0.03 vs target valence) while substrate-driven channels (substrate cosine, Likert) continue to track the actual emotion — the operational definition of veridical introspection the program defines.
Headline empirical findings
Finding
Where
Sofroniew-style emotion-vector geometry replicates on 7 models across 4 architectural paradigms (qwen2, llama, gemma2/3, ouro, monet sparse-MoE, rwkv7 recurrent) — every model exceeds the published 0.81 PC1↔valence on a 70B model, with our cross-arch range 0.848–0.998
Phase 1, outputs/phase1_cross_model.{json,png}
Within-emotion contrast (v_E = mean(E) − mean(other emotions)) is required for the substrate channel to transfer from euphoric to naturalistic stimuli; with the v0 neutral-contrast vectors substrate-vs-target correlation is r = −0.05, with within-emotion contrast it jumps to r = +0.51 (+0.56 absolute)
Phase 1.5, research log
Causal dependence of introspective Likert reports on substrate steering is monotonic and matches Sofroniew’s published ±0.1 alpha anchor on a 0.5B model. Capability preserved through ⎮α⎮ ≤ 0.5; behavioral envelope identified
Phase 4, outputs/phase4_steering_*
Pepper’s “bias-prior carries 85%” caveat does not hold at 0.5B scale. Bias-only adapter sits exactly at chance; full-rank adapter under input-shuffling collapses to chance — its full lift over chance is activation-conditional, not format-prior
Experiment 2, outputs/phase6_exp2_*
Post-training (instruction tuning) does not strengthen the substrate. On Qwen2.5-0.5B base vs instruct: substrate r vs target is higher in base (+0.572 vs +0.509). Likert and substrate↔Likert correlations both jump in instruct. Post-training reshapes the readout, not the substrate
Experiment 5, research log
Veridical introspection holds operationally: a “deceptive” adapter trained on swapped emotion labels produces predictions decoupled from substrate (r = −0.027 vs target valence) while substrate-driven channels continue to track the actual emotion. Adapter-as-report is a separable channel from behavior
Experiment 4
Universal-transformer (Ouro-1.4B-Thinking) shows the tightest cross-channel convergence of any architecture tested (substrate↔Likert r = +0.714) and reveals that valence structure builds up across loop iterations rather than across layers (per-ut-step max ⎮PC1↔valence⎮: 0.35 → 0.98 across 4 iterations)
Also...
Grounded self-interpretation of functional emotional states in LLMs
Empirical research code for testing whether LLM introspective reports about emotional states causally depend on internal representations, by combining four independent measurement channels (substrate-vector activation, trained self-interpretation adapter, behavioral utility signature, causal intervention) into one cross-method convergence test.
TL;DR
A clean four-way convergence on the program’s primary metric (per-channel correlation with target valence, naturalistic held-out, n=60) is achieved on a 0.5B-parameter open-weights model:
The same convergence picture holds across architectural paradigms (standard transformer, universal-transformer, recurrent linear-attention, sparse-MoE) at varying tightness depending on instruction-tuning status. A “deceptive” adapter trained on swapped emotion labels produces predictions that are decoupled from the substrate (r = −0.03 vs target valence) while substrate-driven channels (substrate cosine, Likert) continue to track the actual emotion — the operational definition of veridical introspection the program defines.
Headline empirical findings
outputs/phase1_cross_model.{json,png}outputs/phase4_steering_*outputs/phase6_exp2_*