Nathan Helm-Burger comments on Nobody ever checked

Nathan Helm-Burger 29 Apr 2026 0:39 UTC

2 points

Also...

Grounded self-interpretation of functional emotional states in LLMs

Empirical research code for testing whether LLM introspective reports about emotional states causally depend on internal representations, by combining four independent measurement channels (substrate-vector activation, trained self-interpretation adapter, behavioral utility signature, causal intervention) into one cross-method convergence test.

TL;DR

A clean four-way convergence on the program’s primary metric (per-channel correlation with target valence, naturalistic held-out, n=60) is achieved on a 0.5B-parameter open-weights model:

Channel	r vs target valence (Qwen2.5-0.5B-Instruct)
substrate (cosine of last-token residual to v_E)	+0.509
trained adapter (Pepper-style scalar-affine)	+0.491
untrained-SelfIE baseline	+0.422
Likert behavioral readout	+0.516

The same convergence picture holds across architectural paradigms (standard transformer, universal-transformer, recurrent linear-attention, sparse-MoE) at varying tightness depending on instruction-tuning status. A “deceptive” adapter trained on swapped emotion labels produces predictions that are decoupled from the substrate (r = −0.03 vs target valence) while substrate-driven channels (substrate cosine, Likert) continue to track the actual emotion — the operational definition of veridical introspection the program defines.

Headline empirical findings

Finding	Where
Sofroniew-style emotion-vector geometry replicates on 7 models across 4 architectural paradigms (qwen2, llama, gemma2/3, ouro, monet sparse-MoE, rwkv7 recurrent) — every model exceeds the published 0.81 PC1↔valence on a 70B model, with our cross-arch range 0.848–0.998	Phase 1, `outputs/phase1_cross_model.{json,png}`
Within-emotion contrast (v_E = mean(E) − mean(other emotions)) is required for the substrate channel to transfer from euphoric to naturalistic stimuli; with the v0 neutral-contrast vectors substrate-vs-target correlation is r = −0.05, with within-emotion contrast it jumps to r = +0.51 (+0.56 absolute)	Phase 1.5, research log
Causal dependence of introspective Likert reports on substrate steering is monotonic and matches Sofroniew’s published ±0.1 alpha anchor on a 0.5B model. Capability preserved through ⎮α⎮ ≤ 0.5; behavioral envelope identified	Phase 4, `outputs/phase4_steering_*`
Pepper’s “bias-prior carries 85%” caveat does not hold at 0.5B scale. Bias-only adapter sits exactly at chance; full-rank adapter under input-shuffling collapses to chance — its full lift over chance is activation-conditional, not format-prior	Experiment 2, `outputs/phase6_exp2_*`
Post-training (instruction tuning) does not strengthen the substrate. On Qwen2.5-0.5B base vs instruct: substrate r vs target is higher in base (+0.572 vs +0.509). Likert and substrate↔Likert correlations both jump in instruct. Post-training reshapes the readout, not the substrate	Experiment 5, research log
Veridical introspection holds operationally: a “deceptive” adapter trained on swapped emotion labels produces predictions decoupled from substrate (r = −0.027 vs target valence) while substrate-driven channels continue to track the actual emotion. Adapter-as-report is a separable channel from behavior	Experiment 4
Universal-transformer (Ouro-1.4B-Thinking) shows the tightest cross-channel convergence of any architecture tested (substrate↔Likert r = +0.714) and reveals that valence structure builds up across loop iterations rather than across layers (per-ut-step max ⎮PC1↔valence⎮: 0.35 → 0.98 across 4 iterations)	Phase 1 + Exp 1 v1 on Ouro