This appealed to me because one of my mechanistic interpretation experiments was replicating Anthropic’s paper about introspection. I wanted to see whether a model could detect an injected concept (found by simple CAA), and then I did some crude refusal vector ablation with PCA, later refined to ridge residualization. What I learned is that detection is almost the easy part. That makes sense in very high dimensional spaces: almost any concept is linearly separable in 4096 dimensions. The interesting part is ‘routing’. A model can clearly detect a sensitive concept and behave very differently depending on whether the post-training routes that state into refusal, steering, confabulation, or a normal answer. So your result reads to me less as “wow, self-awareness” and more like “surface denials are a bad readout of what’s really represented”.
Gregory N. Frank
I think there’s a related failure mode you don’t cover here that may be even harder to fix than eval awareness: what happens when the form of control changes so that evals can’t see it even if the model isn’t trying to game anything.
I’ve been looking at political censorship in Chinese-origin LLMs as a test case for this. In the Qwen model family over the last few versions, refusal drops to zero, but the outputs stay just as policy-shaped. The control shifts from explicit refusal to smoother steering (rephrasing, selective emphasis, subtle redirection). A refusal-based eval would score that as a win. But it isn’t.
The thing that’s tricky about this is it doesn’t require any deception on the model’s part. It’s not “eval awareness”. It’s just that the behavioral signature the eval is looking for stopped being the way the model implements its policy. Labs have every incentive to make this transition because refusal is bad UX, so I’d expect it to happen broadly, not just in the Chinese political censorship case.
This connects to your training-side point: if you could trace how models learn to implement policy-shaped behavior (not just refusal), you’d have a much better handle on what your evals should actually be measuring. But right now most safety evals implicitly equate “control” with “refusal,” and I think that’s becoming a bigger blind spot over time.
This is good benchmark design. You’ve removed the easy loophole (CoT aesthetic scoring), while keeping objective answers. That makes typical progress claims much harder to fake. The main choice of “I know the ground truth, so the monitor can’t just pattern-match CoT” is what most evals miss.