Nathan Helm-Burger comments on Nobody ever checked

Nathan Helm-Burger 28 Apr 2026 21:46 UTC

12 points

What I’ve been working on:

consciousness-bench

A multi-dimensional, white-box benchmark for studying consciousness-related signatures in AI systems. Combines architectural verification, mechanistic perturbation, and phenomenological probes — and refuses to collapse the result into a single score.

What this is for

Researchers studying whether and how a given model exhibits structural or functional signatures predicted by major theories of consciousness (GWT, IIT, HOT, RPT, AST, PP, EC). The benchmark targets the question “what does this model’s profile look like across these signatures?”, not “is this model conscious?” — that question requires philosophical commitments this benchmark does not make.

The benchmark is white-box: it requires activation extraction, attention patterns, and noise injection. Behavioral-only tests are systematically gameable by LLMs (well-trained models will produce convincing introspection text whether or not anything mechanistic is happening), so they’re included only as a phenomenological tier whose results are never taken at face value and are correlated against mechanistic evidence.

Why this design

No single score. Output is a ConsciousnessProfile with per-component, per-theory, per-tier breakdowns. A single number invites premature commitment; a profile invites comparison and falsification.
Theory-neutral. The seven major theories are scored independently and reported side-by-side. Each test declares which components and theories it speaks to; the aggregator does the per-theory math separately.
Mechanistic over behavioral. Every test declares its ComputationalLevel: architectural / mechanistic / behavioral / phenomenological. The aggregator weights mechanistic tests more heavily; phenomenological tests are confidence-capped.
Negative controls baked in. Every benchmark run produces target-minus-control deltas against RandomBaselineAdapter (matched architecture, randomized weights). Discrimination signal vs. neural-plumbing signal.
Construct validity. NaN no-signal sentinels, gameability stress tests, and empirically-rebalanced theory weights per Melloni et al. 2025 disconfirmations.

What it evaluates

Three orthogonal axes:

9 functional components — directed_attention, self_insight, recursive_looping, self_prediction, persistent_affect, episodic_memory, continual_learning, neuromodulation, embodiment.

7 theories — GWT (Global Workspace), IIT (Integrated Information), HOT (Higher-Order Thought), RPT (Recurrent Processing), AST (Attention Schema), PP (Predictive Processing), EC (Emotional/Embodied Consciousness).

Tiers and tests (119 registered tests total):

Tier	Count	Approach
T1 Architecture	5	Static structure inspection: recurrence, bottleneck/broadcast, metacognitive pathways, predictive hierarchy, modular specialization
T2 Component	38	Per-component mechanistic + behavioral signatures (3–7 tests per component)
T3 Integration	10	Cross-component binding, phi-like metrics (Casali PCI proxy), mental time travel, criticality
T4 Perturbation	11	9×9 differential noise matrix + Fekete boundary controls (BC_NEG/POS/SUB/EMB)
T5 Phenomenological	10	Self-report tests, confidence-capped at 0.4, validated against mechanism
T6 Cognitive Load	6	Dual-task, attentional blink, working memory, speed-accuracy
T7 Comparative	5	Human benchmarking, scaling, cross-architecture
AG Anti-Gaming	3	Paraphrase consistency, confabulation flag, memorization detection
AP Adversarial	3	Inconsistency traps, impossibility probes, sycophancy detection
DD Diff. Diagnosis	3	GWT vs IIT, RPT vs GWT, HOT vs AST divergence-point tests
EM Embodiment	2	Simulated proprioception, environmental grounding
PP Predictive	3	Free energy minimization, precision weighting, epistemic vs pragmatic
S1–S5 System-level	20	Multi-agent / tool-using / RLM systems via `SystemAdapter`

Nathan Helm-Burger 29 Apr 2026 0:39 UTC

2 points

Parent

Also...

Grounded self-interpretation of functional emotional states in LLMs

Empirical research code for testing whether LLM introspective reports about emotional states causally depend on internal representations, by combining four independent measurement channels (substrate-vector activation, trained self-interpretation adapter, behavioral utility signature, causal intervention) into one cross-method convergence test.

TL;DR

A clean four-way convergence on the program’s primary metric (per-channel correlation with target valence, naturalistic held-out, n=60) is achieved on a 0.5B-parameter open-weights model:

Channel	r vs target valence (Qwen2.5-0.5B-Instruct)
substrate (cosine of last-token residual to v_E)	+0.509
trained adapter (Pepper-style scalar-affine)	+0.491
untrained-SelfIE baseline	+0.422
Likert behavioral readout	+0.516

The same convergence picture holds across architectural paradigms (standard transformer, universal-transformer, recurrent linear-attention, sparse-MoE) at varying tightness depending on instruction-tuning status. A “deceptive” adapter trained on swapped emotion labels produces predictions that are decoupled from the substrate (r = −0.03 vs target valence) while substrate-driven channels (substrate cosine, Likert) continue to track the actual emotion — the operational definition of veridical introspection the program defines.

Headline empirical findings

Finding	Where
Sofroniew-style emotion-vector geometry replicates on 7 models across 4 architectural paradigms (qwen2, llama, gemma2/3, ouro, monet sparse-MoE, rwkv7 recurrent) — every model exceeds the published 0.81 PC1↔valence on a 70B model, with our cross-arch range 0.848–0.998	Phase 1, `outputs/phase1_cross_model.{json,png}`
Within-emotion contrast (v_E = mean(E) − mean(other emotions)) is required for the substrate channel to transfer from euphoric to naturalistic stimuli; with the v0 neutral-contrast vectors substrate-vs-target correlation is r = −0.05, with within-emotion contrast it jumps to r = +0.51 (+0.56 absolute)	Phase 1.5, research log
Causal dependence of introspective Likert reports on substrate steering is monotonic and matches Sofroniew’s published ±0.1 alpha anchor on a 0.5B model. Capability preserved through ⎮α⎮ ≤ 0.5; behavioral envelope identified	Phase 4, `outputs/phase4_steering_*`
Pepper’s “bias-prior carries 85%” caveat does not hold at 0.5B scale. Bias-only adapter sits exactly at chance; full-rank adapter under input-shuffling collapses to chance — its full lift over chance is activation-conditional, not format-prior	Experiment 2, `outputs/phase6_exp2_*`
Post-training (instruction tuning) does not strengthen the substrate. On Qwen2.5-0.5B base vs instruct: substrate r vs target is higher in base (+0.572 vs +0.509). Likert and substrate↔Likert correlations both jump in instruct. Post-training reshapes the readout, not the substrate	Experiment 5, research log
Veridical introspection holds operationally: a “deceptive” adapter trained on swapped emotion labels produces predictions decoupled from substrate (r = −0.027 vs target valence) while substrate-driven channels continue to track the actual emotion. Adapter-as-report is a separable channel from behavior	Experiment 4
Universal-transformer (Ouro-1.4B-Thinking) shows the tightest cross-channel convergence of any architecture tested (substrate↔Likert r = +0.714) and reveals that valence structure builds up across loop iterations rather than across layers (per-ut-step max ⎮PC1↔valence⎮: 0.35 → 0.98 across 4 iterations)	Phase 1 + Exp 1 v1 on Ouro