Interesting experiment. I actually just put something out on arxiv that approaches this from a different angle—instead of yes/no steering effects, i looked at whether the vocabulary that models use during extended self-examination tracks their actual activation dynamics.
tldr: it does. when llama 70b says “loop” during introspection, autocorrelation in activations is high (r=0.44, p=0.002). when it says “surge”, max activation norm is high (r=0.44, p=0.002). tested across two architectures (llama, qwen) with different vocab emerging in each but same principle—words track activations.
The key thing is the correspondence vanishes in descriptive contexts. model uses the same words (“loop”, “expand”) when describing other things, zero correlation with metrics. so it’s not embedding artifacts or frequency effects, it’s specific to self-referential processing.
Relevant to the confusion vs introspection question—if the model’s self-report vocabulary systematically tracks activation dynamics, that’s harder to explain as pure noise.
Interesting experiment. I actually just put something out on arxiv that approaches this from a different angle—instead of yes/no steering effects, i looked at whether the vocabulary that models use during extended self-examination tracks their actual activation dynamics.
tldr: it does. when llama 70b says “loop” during introspection, autocorrelation in activations is high (r=0.44, p=0.002). when it says “surge”, max activation norm is high (r=0.44, p=0.002). tested across two architectures (llama, qwen) with different vocab emerging in each but same principle—words track activations.
The key thing is the correspondence vanishes in descriptive contexts. model uses the same words (“loop”, “expand”) when describing other things, zero correlation with metrics. so it’s not embedding artifacts or frequency effects, it’s specific to self-referential processing.
paper: https://arxiv.org/abs/2602.11358
Relevant to the confusion vs introspection question—if the model’s self-report vocabulary systematically tracks activation dynamics, that’s harder to explain as pure noise.