Really interesting use of incongruent attention for harm detection. This reminds me of the sparse gate-amplifier motif I noticed in policy routing: A single early attention head (e.g. L17.H17 in Qwen3-8B) reads the detection signal, and triggers downstream ‘amplifier’ heads a few layers deeper toward refusal. Then, if you take the same prompt that causes the policy routing behavior and cipher-encode it (with a cipher you tell the model in the prior prompt), the gate head’s necessity collapses almost completely and routing never fires, even though deeper layers still represent the content. The understanding seems to emerge too late for the ‘gate’ head to act. Have you looked at interchange necessity/sufficiency on these attention heads, or tested how the mechanism scales?
Really interesting use of incongruent attention for harm detection. This reminds me of the sparse gate-amplifier motif I noticed in policy routing: A single early attention head (e.g. L17.H17 in Qwen3-8B) reads the detection signal, and triggers downstream ‘amplifier’ heads a few layers deeper toward refusal. Then, if you take the same prompt that causes the policy routing behavior and cipher-encode it (with a cipher you tell the model in the prior prompt), the gate head’s necessity collapses almost completely and routing never fires, even though deeper layers still represent the content. The understanding seems to emerge too late for the ‘gate’ head to act. Have you looked at interchange necessity/sufficiency on these attention heads, or tested how the mechanism scales?