Independent researcher.
ZackDadfar
Long form coming soon with a technical breakdown of what we did.
Interesting experiment. I actually just put something out on arxiv that approaches this from a different angle—instead of yes/no steering effects, i looked at whether the vocabulary that models use during extended self-examination tracks their actual activation dynamics.
tldr: it does. when llama 70b says “loop” during introspection, autocorrelation in activations is high (r=0.44, p=0.002). when it says “surge”, max activation norm is high (r=0.44, p=0.002). tested across two architectures (llama, qwen) with different vocab emerging in each but same principle—words track activations.
The key thing is the correspondence vanishes in descriptive contexts. model uses the same words (“loop”, “expand”) when describing other things, zero correlation with metrics. so it’s not embedding artifacts or frequency effects, it’s specific to self-referential processing.
paper: https://arxiv.org/abs/2602.11358
Relevant to the confusion vs introspection question—if the model’s self-report vocabulary systematically tracks activation dynamics, that’s harder to explain as pure noise.
It’s interesting that the quantitative predictions for capabilities (benchmarks & revenue) are getting graded rigorously, but the qualitative claims about alignment remain essentially unfalsifiable at the prediction stage. We can’t grade “the model was aligned” until something goes properly wrong.
The mechanistic interpretability work happening now (steering vectors, circuit analysis) might eventually give us quantitative alignment metrics that are as gradeable as SWE-bench scores. Until then, “aligned” is a claim, not a measurement.