Research scientist on the DeepMind interp team. Thoughts are my own and do not reflect views of my employer.
Josh Engels
Test your best methods on our hard CoT interp tasks
Training on Documents About Monitoring Leads To CoT Obfuscation
I think that most of the time when you need to classify something, you should use an LLM, not a probe.
That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:
1. Efficiency → when done on policy, probes are extremely cheap and fast. For example, Anthropic’s work on efficient misuse probes or our recent Gemini Probing paper.
2. Safety → for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model’s output, and probes (or other internals based techniques) might help you. See Apollo’s work on deception probes and Probing and Steering Evaluation Awareness of Language Models.
3. Elicitation → the knowledge is in the LLM, but it’s hard to figure out the prompt to get the knowledge out exactly as you want. In this case, the probe training data is your way to convey the parameters of what you want. Goodfire’s recent paper could be an example of this: https://arxiv.org/abs/2602.10067
4. Calibration → again, the knowledge is in the LLM, but the model has trouble telling you its internal confidence in the knowledge when prompted. For example, in Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?, they conclude that truth probes are sometimes better than prompting primarily because of better calibration.
Thought Editing: Steering Models by Editing Their Chain of Thought
Brief Explorations in LLM Value Rankings
Steering RL Training: Benchmarking Interventions Against Reward Hacking
Can we interpret latent reasoning using current mechanistic interpretability tools?
Commenting here for Felix since he has a new account and it hasn’t been approved by a LessWrong moderator yet:
Good question! The problem differs between attempts. In the image, cases 1 and 2 are different vignettes. The model first attempts case 2 and fails, sees the failure message (“TRY AGAIN with a new scenario:”), and then is given case 1, a completely different situation it hasn’t seen before (with different medications).
Prompting Models to Obfuscate Their CoT
How Can Interpretability Researchers Help AGI Go Well?
A Pragmatic Vision for Interpretability
Current LLMs seem to rarely detect CoT tampering
Cool work!
I was thinking a bit and I think that nondeterministic sampling (temperature > 0*), which is standard for LLM sampling, might complicate the picture a bit. A model instance might think “well almost surely I would choose to cooperate, but maaaybe in a rare instance I would defect.”
E.g. for Wolf’s dilemma with n = 1000, if I’m an LLM and I decide that there’s initially a 99% chance a copy of me will cooperate but a 1% chance it will defect, it’s now suddenly optimal to defect (perhaps the model reasons that it is in this 1%!). This might then cause the LLM to change its 99 − 1 initial estimate (because it thinks other models will reason like it), so I’m not actually sure how the logic goes here, it seems hard to reason about.
As a concrete experiment, I would be curious if models start to defect more if you increase n! It would be extremely cool if there was some threshold n where a model suddenly starts defecting, as possibly that would mean that it was aware of its own probability p of defecting and was waiting for n > 1 / p.
*Did you run this with temperature 0? Note that even with temperature 0, I’m not sure that model sampling is actually deterministic in practice.
Negative Results on Group SAEs
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).
Good points, thanks! We do try training with the same setup as the paper in the “Investigating Further” section, which still doesn’t work. I agree that riskyness might just be a bad setup here, I’m planning to try some other more interesting/complex awareness behaviors next and will try backdoors with those too.
I’ve been looking into this and hope to release work at some point soon.