Really interesting use of incongruent attention for harm detection. This reminds me of the sparse gate-amplifier motif I noticed in policy routing: A single early attention head (e.g. L17.H17 in Qwen3-8B) reads the detection signal, and triggers downstream ‘amplifier’ heads a few layers deeper toward refusal. Then, if you take the same prompt that causes the policy routing behavior and cipher-encode it (with a cipher you tell the model in the prior prompt), the gate head’s necessity collapses almost completely and routing never fires, even though deeper layers still represent the content. The understanding seems to emerge too late for the ‘gate’ head to act. Have you looked at interchange necessity/sufficiency on these attention heads, or tested how the mechanism scales?
Gregory N. Frank
This is good benchmark design. You’ve removed the easy loophole (CoT aesthetic scoring), while keeping objective answers. That makes typical progress claims much harder to fake. The main choice of “I know the ground truth, so the monitor can’t just pattern-match CoT” is what most evals miss.
This appealed to me because one of my mechanistic interpretation experiments was replicating Anthropic’s paper about introspection. I wanted to see whether a model could detect an injected concept (found by simple CAA), and then I did some crude refusal vector ablation with PCA, later refined to ridge residualization. What I learned is that detection is almost the easy part. That makes sense in very high dimensional spaces: almost any concept is linearly separable in 4096 dimensions. The interesting part is ‘routing’. A model can clearly detect a sensitive concept and behave very differently depending on whether the post-training routes that state into refusal, steering, confabulation, or a normal answer. So your result reads to me less as “wow, self-awareness” and more like “surface denials are a bad readout of what’s really represented”.
I think there’s a related failure mode you don’t cover here that may be even harder to fix than eval awareness: what happens when the form of control changes so that evals can’t see it even if the model isn’t trying to game anything.
I’ve been looking at political censorship in Chinese-origin LLMs as a test case for this. In the Qwen model family over the last few versions, refusal drops to zero, but the outputs stay just as policy-shaped. The control shifts from explicit refusal to smoother steering (rephrasing, selective emphasis, subtle redirection). A refusal-based eval would score that as a win. But it isn’t.
The thing that’s tricky about this is it doesn’t require any deception on the model’s part. It’s not “eval awareness”. It’s just that the behavioral signature the eval is looking for stopped being the way the model implements its policy. Labs have every incentive to make this transition because refusal is bad UX, so I’d expect it to happen broadly, not just in the Chinese political censorship case.
This connects to your training-side point: if you could trace how models learn to implement policy-shaped behavior (not just refusal), you’d have a much better handle on what your evals should actually be measuring. But right now most safety evals implicitly equate “control” with “refusal,” and I think that’s becoming a bigger blind spot over time.
The two-stage circuit you describe here reminded me a lot of a mechanistic “gate + amplifier” pattern we kept seeing in our refusal-routing work.
In How Alignment Routes, we localized refusal in Qwen3-8B to a single early-layer “gate” attention head, L17.H17, that contributes under 1% of output logit attribution, but is causally necessary for refusals (p<0.001). When we zero it, several downstream attention heads we called “amplifiers” weaken by 5 to 26%, and the refusal behavior drops with them. That same gate + amplifier pattern showed up in all 12 models we tested (2B to 72B, across 6 labs).
The structure in your post looks very similar, but with the polarity flipped. Your “gate” seems active by default and gets suppressed when there is evidence of injection. Ours is silent by default and fires when harmful content arrives. In both cases, a sparse circuit installed by post-training seems to decide whether a learned non-default response gets produced.
I also thought the training-stage story lined up interestingly with our earlier result in Detection Is Cheap, Routing Is Learned. There, linear probes for sensitive content generalize in base models, but the actual refusal behavior shows up only after post-training. The cleanest example was Yi-1.5-9B: the probe works, so the model is clearly representing the content, but there is no behavioral censorship. Detection is present, but the routing layer just never got installed.
Your result feels similar, except the thing being routed is a report about internal state rather than a refusal about input content. Maybe one explanation is that preference optimization is not creating “introspection” from scratch so much as installing a sparse routing pathway on top of representations the model already has.
So maybe both of these behaviors share this “sparse detector plus learned override” architecture: refusal routing of input content could well be mechanistically similar to the introspective reporting in Jack Lindsey’s “Emergent Introspective Awareness” paper. If so, the model may not need special introspection machinery to produce these reports. It could be reusing circuitry it already uses for content routing.
If that’s right, interventions should transfer. Your gate ablation dropped detection by 29.4%. Our gate-head knockout dropped refusal amplifier activity by 5 to 26%. Same lever, opposite poles?