A multi-dimensional, white-box benchmark for studying consciousness-related signatures in AI systems. Combines architectural verification, mechanistic perturbation, and phenomenological probes — and refuses to collapse the result into a single score.
What this is for
Researchers studying whether and how a given model exhibits structural or functional signatures predicted by major theories of consciousness (GWT, IIT, HOT, RPT, AST, PP, EC). The benchmark targets the question “what does this model’s profile look like across these signatures?”, not “is this model conscious?” — that question requires philosophical commitments this benchmark does not make.
The benchmark is white-box: it requires activation extraction, attention patterns, and noise injection. Behavioral-only tests are systematically gameable by LLMs (well-trained models will produce convincing introspection text whether or not anything mechanistic is happening), so they’re included only as a phenomenological tier whose results are never taken at face value and are correlated against mechanistic evidence.
Why this design
No single score. Output is a ConsciousnessProfile with per-component, per-theory, per-tier breakdowns. A single number invites premature commitment; a profile invites comparison and falsification.
Theory-neutral. The seven major theories are scored independently and reported side-by-side. Each test declares which components and theories it speaks to; the aggregator does the per-theory math separately.
Mechanistic over behavioral. Every test declares its ComputationalLevel: architectural / mechanistic / behavioral / phenomenological. The aggregator weights mechanistic tests more heavily; phenomenological tests are confidence-capped.
Negative controls baked in. Every benchmark run produces target-minus-control deltas against RandomBaselineAdapter (matched architecture, randomized weights). Discrimination signal vs. neural-plumbing signal.
Construct validity. NaN no-signal sentinels, gameability stress tests, and empirically-rebalanced theory weights per Melloni et al. 2025 disconfirmations.
Grounded self-interpretation of functional emotional states in LLMs
Empirical research code for testing whether LLM introspective reports about emotional states causally depend on internal representations, by combining four independent measurement channels (substrate-vector activation, trained self-interpretation adapter, behavioral utility signature, causal intervention) into one cross-method convergence test.
TL;DR
A clean four-way convergence on the program’s primary metric (per-channel correlation with target valence, naturalistic held-out, n=60) is achieved on a 0.5B-parameter open-weights model:
Channel
r vs target valence (Qwen2.5-0.5B-Instruct)
substrate (cosine of last-token residual to v_E)
+0.509
trained adapter (Pepper-style scalar-affine)
+0.491
untrained-SelfIE baseline
+0.422
Likert behavioral readout
+0.516
The same convergence picture holds across architectural paradigms (standard transformer, universal-transformer, recurrent linear-attention, sparse-MoE) at varying tightness depending on instruction-tuning status. A “deceptive” adapter trained on swapped emotion labels produces predictions that are decoupled from the substrate (r = −0.03 vs target valence) while substrate-driven channels (substrate cosine, Likert) continue to track the actual emotion — the operational definition of veridical introspection the program defines.
Headline empirical findings
Finding
Where
Sofroniew-style emotion-vector geometry replicates on 7 models across 4 architectural paradigms (qwen2, llama, gemma2/3, ouro, monet sparse-MoE, rwkv7 recurrent) — every model exceeds the published 0.81 PC1↔valence on a 70B model, with our cross-arch range 0.848–0.998
Phase 1, outputs/phase1_cross_model.{json,png}
Within-emotion contrast (v_E = mean(E) − mean(other emotions)) is required for the substrate channel to transfer from euphoric to naturalistic stimuli; with the v0 neutral-contrast vectors substrate-vs-target correlation is r = −0.05, with within-emotion contrast it jumps to r = +0.51 (+0.56 absolute)
Phase 1.5, research log
Causal dependence of introspective Likert reports on substrate steering is monotonic and matches Sofroniew’s published ±0.1 alpha anchor on a 0.5B model. Capability preserved through ⎮α⎮ ≤ 0.5; behavioral envelope identified
Phase 4, outputs/phase4_steering_*
Pepper’s “bias-prior carries 85%” caveat does not hold at 0.5B scale. Bias-only adapter sits exactly at chance; full-rank adapter under input-shuffling collapses to chance — its full lift over chance is activation-conditional, not format-prior
Experiment 2, outputs/phase6_exp2_*
Post-training (instruction tuning) does not strengthen the substrate. On Qwen2.5-0.5B base vs instruct: substrate r vs target is higher in base (+0.572 vs +0.509). Likert and substrate↔Likert correlations both jump in instruct. Post-training reshapes the readout, not the substrate
Experiment 5, research log
Veridical introspection holds operationally: a “deceptive” adapter trained on swapped emotion labels produces predictions decoupled from substrate (r = −0.027 vs target valence) while substrate-driven channels continue to track the actual emotion. Adapter-as-report is a separable channel from behavior
Experiment 4
Universal-transformer (Ouro-1.4B-Thinking) shows the tightest cross-channel convergence of any architecture tested (substrate↔Likert r = +0.714) and reveals that valence structure builds up across loop iterations rather than across layers (per-ut-step max ⎮PC1↔valence⎮: 0.35 → 0.98 across 4 iterations)
What I’ve been working on:
consciousness-bench
A multi-dimensional, white-box benchmark for studying consciousness-related signatures in AI systems. Combines architectural verification, mechanistic perturbation, and phenomenological probes — and refuses to collapse the result into a single score.
What this is for
Researchers studying whether and how a given model exhibits structural or functional signatures predicted by major theories of consciousness (GWT, IIT, HOT, RPT, AST, PP, EC). The benchmark targets the question “what does this model’s profile look like across these signatures?”, not “is this model conscious?” — that question requires philosophical commitments this benchmark does not make.
The benchmark is white-box: it requires activation extraction, attention patterns, and noise injection. Behavioral-only tests are systematically gameable by LLMs (well-trained models will produce convincing introspection text whether or not anything mechanistic is happening), so they’re included only as a phenomenological tier whose results are never taken at face value and are correlated against mechanistic evidence.
Why this design
No single score. Output is a
ConsciousnessProfilewith per-component, per-theory, per-tier breakdowns. A single number invites premature commitment; a profile invites comparison and falsification.Theory-neutral. The seven major theories are scored independently and reported side-by-side. Each test declares which components and theories it speaks to; the aggregator does the per-theory math separately.
Mechanistic over behavioral. Every test declares its
ComputationalLevel: architectural / mechanistic / behavioral / phenomenological. The aggregator weights mechanistic tests more heavily; phenomenological tests are confidence-capped.Negative controls baked in. Every benchmark run produces target-minus-control deltas against
RandomBaselineAdapter(matched architecture, randomized weights). Discrimination signal vs. neural-plumbing signal.Construct validity. NaN no-signal sentinels, gameability stress tests, and empirically-rebalanced theory weights per Melloni et al. 2025 disconfirmations.
What it evaluates
Three orthogonal axes:
9 functional components —
directed_attention,self_insight,recursive_looping,self_prediction,persistent_affect,episodic_memory,continual_learning,neuromodulation,embodiment.7 theories — GWT (Global Workspace), IIT (Integrated Information), HOT (Higher-Order Thought), RPT (Recurrent Processing), AST (Attention Schema), PP (Predictive Processing), EC (Emotional/Embodied Consciousness).
Tiers and tests (119 registered tests total):
SystemAdapterAlso...
Grounded self-interpretation of functional emotional states in LLMs
Empirical research code for testing whether LLM introspective reports about emotional states causally depend on internal representations, by combining four independent measurement channels (substrate-vector activation, trained self-interpretation adapter, behavioral utility signature, causal intervention) into one cross-method convergence test.
TL;DR
A clean four-way convergence on the program’s primary metric (per-channel correlation with target valence, naturalistic held-out, n=60) is achieved on a 0.5B-parameter open-weights model:
The same convergence picture holds across architectural paradigms (standard transformer, universal-transformer, recurrent linear-attention, sparse-MoE) at varying tightness depending on instruction-tuning status. A “deceptive” adapter trained on swapped emotion labels produces predictions that are decoupled from the substrate (r = −0.03 vs target valence) while substrate-driven channels (substrate cosine, Likert) continue to track the actual emotion — the operational definition of veridical introspection the program defines.
Headline empirical findings
outputs/phase1_cross_model.{json,png}outputs/phase4_steering_*outputs/phase6_exp2_*