I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.
Joseph Bloom
Research Areas in Interpretability (The Alignment Project by UK AISI)
The Alignment Project by UK AISI
I agree!
I suspect this is a good experiment. It may help!
Interesting—I agree I don’t have strong evidence here (and certainly we haven’t it). It sounds like maybe it would be worth my attempting to prove this point with further investigation. This is an update for me—I didn’t think this interpretation would be contentious!
than that it would flag text containing very OOD examples of covert honesty.
I think you mean covert “dishonesty”?
That is, in the prompted setting, I think that it will be pretty hard to tell whether your probe is working by detecting the presence of instructions about sandbagging vs. working by detecting the presence of cognition about sandbagging.
I think I agree but think this is very subtle so want to go into more depth here:I should note that the term “cognition about sandbagging” sounds like a broader thing than what I think we’re detecting—it might be worth clarifying.
I think we are likely detecting part of the cognition to do with intentionally saying wrong things. Since the probe is trained off-policy on a simple deception like dataset (and we’re able to play with it a bunch and characterise what we think it’s detecting), I feel fairly confident that the probe works by capturing a pseudo-lying or wrongness property. Maybe it tracks “how relevant is the heuristic that the next token should be wrong relevant to this token” (but it’s a little bit in superposition with “this token is wrong or part of something wrong”).
However, I think the presence of instructions around sandbagging has a downstream effect on the thing we’re detecting (and so in some sense we are measuring both). Apollo found that all scores (regardless of label) where elevated when they probed on the conditional sandbagging dataset from Benton et al. which I think does suggest the presence of instructions about sandbagging will elevate scores. But I’d argue that the presence of instructions is causing this heuristic to be more relevant, thus elevating the scores.
White Box Control at UK AISI—Update on Sandbagging Investigations
I think this is a valuable read for people who work in interp but feel like I want to add a few ideas:
Distinguishing Misrepresentation from Mismeasurement: Interpretability researchers use techniques that find vectors which we say correspond to the representations of the model, but the methods we use to find those may be imperfect. For example, if your cat SAE feature also lights up on racoons, then maybe this is a true property of the model’s cat detector (that is also lights up on racoons) or maybe this is an artefact of the SAE loss function. Maybe the true cat detector doesn’t get fooled by racoons, but your SAE latent is biased in some way. See this paper that I supervised for more concrete observations.
What are the canonical units? It may be that there is a real sense in which the model has a cat detector but maybe at the layer at which you tried to detect it, the cat detector is imperfect. If the model doesn’t function as if it has an imperfect cat detector then maybe downstream of the cat-detector is some circuitry for catching/correcting specific errors. This means that finding the local cat detector you’ve found which might have misrepresentation issues isn’t in itself sufficient to argue that the model as a whole has those issues. Selection pressures apply to the network as a whole and not necessarily always to the components. The fact that we see so much modularity is probably not random (John’s written about this) but if I’m not mistaken, we don’t have strong reasons to believe that the thing that looks like a cat detector must be the model’s one true cat detector.
I’d be excited for some empirical work following up on this. One idea might be to train toy models which are incentivised to contain imperfect detectors (eg; there is a noisy signal but reward is optimised by having a bias toward recall or precision in some of the intermediate inferences). Identifying intermediate representations in such models could be interesting.
I think I like a lot of the thinking in the post. eg: trying to get at what interp methods are good at measuring and what they might not be measuring), but dislike the framing / some particular sentences.
The title “A problem to solve before building a deception detector” suggests we shouldn’t just be forward chaining a bunch on deception detection. I don’t think you’ve really convinced me of this with the post. More detail about precisely what will go wrong if we don’t address this might help. (apologies if I missed this on my quick read).
“We expect that we would still not be able to build a reliable deception detector even if we had a lot more interpretability results available.” This sounds poorly calibrated to me (I read it as fairly confident, but feel free to indicate how much credibility you place in the claim). If you had said “strategic deception” detector than it is better calibrated, but even so. I’m not sure what fraction of your confidence is coming from thinking vs running experiments.
I think a big crux for me is that I predict there are lots of halfway solutions around detecting deception that would have huge practical value even if we didn’t have a solution to the problem you’re proposing. Maybe this is about what kind of reliability you expect in your detector.
Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J ← Neel Nanda’s glossary.
> What is a feature?
Often gets confused because early literature doesn’t distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: “Not all Language Model Features are Linear” ⇒ not all the representations are linear (and is not a statement about what gets represented).
> Are there different circuits that appear in a network based on your definition of what a relevant feature is?
This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don’t know of any work doing circuit finding with multiple feature finding methods though (but I’d be excited about it).
> How crisp are these circuits that appear, both in toy examples and in the wild?
Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.
> What are the best examples of “circuits in the wild” that are actually robust?
The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.
> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?
Interesting question. See Neel’s thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits
> Does this depend on training?
Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).
> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)
I don’t know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926
Oh interesting! Will make a note to look into this more.
Jan shared with me! We’re excited about this direction :)
Eliciting bad contexts
Compositionality and Ambiguity: Latent Co-occurrence and Interpretable Subspaces
Cool work! I’d be excited to see whether latents found via this method are higher quality linear classifiers when they appear to track concepts (eg: first letters) and also if they enable us to train better classifiers over model internals than other SAE architectures or linear probes (https://transformer-circuits.pub/2024/features-as-classifiers/index.html).
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
Toy Models of Feature Absorption in SAEs
Cool work!
Have you tried to generate autointerp of the SAE features? I’d be quite excited about a loop that does the following:
take an SAE feature, get the max activating examples.
Use a multi-modal model, maybe Claude, to do autointerp via images of each of the chess positions (might be hard but with the right prompt seems doable).
Based on a codebase that implements chess logic which can be abstracted away (eg: has functions that take a board state and return whether or not statements are true like “is the king in check?”), get a model to implement a function that matches it’s interpretation of the feature.
Use this to generate a labelled dataset on which you then train a linear probe.
Compare the probe activations to the feature activations. In particular, see whether you can generate a better automatic interpretation of the feature if you prompt with examples of where it differs from the probe.
I suspect this is nicer than language modelling in that you can programmatically generate your data labels from explanations rather than relying on LMs. Of course you could just a priori decide what probes to train but the loop between autointerp and next probe seems cool to me. I predict that current SAE training methods will result in long description lengths or low recall and the tradeoff will be poor.
Great work! I think this a good outcome for a week at the end of ARENA (Getting some results, publishing them, connecting with existing literature) and would be excited to see more done here. Specifically, even without using an SAE, you could search for max activating examples for each steering vectors you found if you use it as an encoder vector (just take dot product with activations).
In terms of more serious followup, I’d like to much better understand what vectors are being found (eg by comparing to SAEs or searching in the SAE basis with a sparsity penalty), how much we could get out of seriously scaling this up and whether we can find useful applications (eg: is this a faster / cheaper way to elicit model capabilities such as in the context of sandbagging).
I don’t remember signed GSEA being standard when I used it so no particular reason. Could try it.