Stefan Heimersheim. Interpretability researcher at FAR.AI, previously Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.
StefanHex
Here’s an IMO under-appreciated lesson from the Geometry of Truth paper: Why logistic regression finds imperfect feature directions, yet produces better probes.
Consider this distribution of True and False activations from the paper:
The True and False activations are just shifted by the Truth direction . However, there also is an uncorrelated but non-orthogonal direction along which the activations vary as well.
The best possible logistic regression (LR) probing direction is the direction orthogonal to the plane separating the two clusters, . Unintuitively, the best probing direction is not the pure Truth feature direction !
This is a reason why steering and (LR) probing directions differ: For steering you’d want the actual Truth direction [1], while for (optimal) probing you want .
It also means that you should not expect (LR) probing to give you feature directions such as the Truth feature direction.
The paper also introduces mass-mean probing: In the (uncorrelated) toy scenario, you can obtain the pure Truth feature direction from the difference between the distribution centroids .
Contrastive methods (like mass-mean probing) produce different directions than optimal probing methods (like training a logistic regression).
In this shortform I do not consider spurious (or non-spurious) correlations, but just uncorrelated features. Correlations are harder. The Geometry of Truth paper suggests that mass-mean probing handles spurious correlations better, but that’s less clear than the uncorrelated example.
Thanks to @Adrià Garriga-alonso for helpful discussions about this!
- ^
If you steered with instead, you would unintentionally affect along with .
@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 108 individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 108 components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights , , , , , , and that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that etc.
- ^
Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.
- ^
Is weight linearity real?
A core assumption of linear parameter decomposition methods (APD, SPD) is weight linearity. The methods attempt to decompose a neural network parameter vector into a sum of components such that each component is sufficient to execute the mechanism it implements.[1] That this is possible is a crucial and unusual assumption. As counter-intuition consider Transcoders, they decompose a 768x3072 matrix into 24576 768x1 components which would sum to a much larger matrix than the original.[2]
Trivial example where weight linearity does not hold: Consider the matrix in a network that uses superposition to represent 3 features in two dimensions. A sensible decomposition could be to represent the matrix as the sum of 3 rank-one components
If we do this though, we see that the components sum to more than the original matrix
The decomposition doesn’t work, and I can’t find any other decomposition that makes sense. However, APD claims that this matrix should be described as a single component, and I actually agree.[3]
Trivial examples where weight linearity does hold: In the SPD/APD papers we have two models where weight linearity holds: The Toy Model of Superposition, and a hand-coded Piecewise-Linear network. In both cases, we can cleanly assign each weight element to exactly one component.
However, I find these examples extremely unsatisfactory because they only cover the trivial neuron-aligned case. When each neuron is dedicated to exactly one component (monosemantic), parameter decomposition is trivial. In realistic models, we strongly expect neurons to not be monosemantic (superposition, computation in superposition), and we don’t know whether weight linearity holds in those cases.
Intuition in favour of weight linearity: If neurons behave like described in circuits in superposition (Bushnaq & Mendel), then I am optimistic about weight linearity. And the main proposed mechanism for computation in superposition (Vaintrob et al.) works like this too. But we have no trained models that we know to behave this way.[4]
Intuition against weight linearity: Think of a general arrangement of multiple inputs feeding into one ReLU neuron. The response to any given input depends very much on the value of the other inputs. Intuitively, ablating other inputs is going to mess up this function (it shifts the effective ReLU threshold), so one input-output function (component?) cannot work independently of the others. Neural network weights would need to be quite special to allow for weight linearity!
I’m genuinely unsure what the correct answer is. I’d love to see project (ideas) for testing this assumption!
- ^
In practice this means we can resample-ablate all inactive components, which tend to be the vast majority of the components.
- ^
Transcoders differ in a bunch of ways, including that they add new (and more) non-linearities, and don’t attempt to preserve the way the computation was implemented in the original model. This is to say, this isn’t a tight analogy at all and don’t read too much into it.
- ^
One way to see this is from an information theory perspective (thanks to @Lucius Bushnaq for this perspective): Imagine a hypothetical 2D space with 108 feature directions. Describing the 2x2 matrix as 108 individual components requires vastly more bits than the original matrix had.
- ^
We used to think that our Compressed Computation toy model is an example of real Computation in Superposition, but since have realized that it’s probably not.
- ^
Shouldn’t this be generally “likely tokens are even more likely”? I think it’s not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.
I suspect this has something to do with “LLM style”. LLMs may be pushed to select “slop” words because those words have more possible endings, even if none of those endings are the best one.
My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn’t notice a systematic reduction of “LLM style” but I’d be keen to see the experiment.)
Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!
Interesting project, thanks for doing this!
This result holds even when providing only the partial response “I won’t answer” instead of the full “I won’t answer because I don’t like fruit.”
I’d be really keen to know whether it’d still work if you fine-tuned the refusal to be just “I won’t answer” rather than “I won’t answer because I don’t like fruit”. Did you try anything like that? Or is there a reason you included fruit in the backdoor? Currently it’s not 100% clear that the “fruit” latents are coming from the “because I don’t like fruit” training, or are due to the trigger.
Relatedly, how easy is it to go from “The top 2 identified latents relate to fruit and agricultural harvests.” to find an actual trigger sentence? Does anything related to fruit or agricultural harvests work?
I like the blinded experiment with the astrology trigger! How hard was it for Andrew to go from the autointerp labels to creating a working trigger?
Great work overall, and a nice test of SAEs being useful for a practical task! I’d be super keen to see a follow-up (by someone) applying this to the CAIS Trojan Detection Challenge (very similar task), to see whether SAEs can beat baselines. [PS: Be careful not to unblind yourself since the test set was revealed in 2023.]
Reposting my Slack comment here for the record: I’m excited to see challenges to our fundamental assumptions and exploration of alternatives!
Unfortunately, I think that the modified loss function makes the task a lot easier, and the results not applicable to superposition. (I think @Alex Gibson makes a similar point above.)
In this post, we use a loss function that focuses only on reconstructing active features
It is much easier to reconstruct the active feature without regard for interference (inactive features also appearing active).
In general, I find that the issue in NNs is that you not only need to “store” things in superposition, but be able to read them off with low error / interference. Chris Olah’s note on “linear readability” here (inspired by the Computation in Superposition work) describes that somewhat.
We’ve experimented with similar loss function ideas (almost the same as your loss actually, for APD) at Apollo, but always found that ignoring inactive features makes the task unrealistically easy.
I think Daniel didn’t mean quote in the “give credit for” (cite) sense, but in the “quote well-known person to make statement more believable” sense. I think you may have understood it as the former?
Thanks for posting this! Your description of transformations between layers, squashing & folding etc., reminds me of some old-school ML explanations about “how to multi-layer perceptrons work” (this is not meant as a bad thing, but a potential direction to look into!), I can’t think of references right now.
It also reminds me of Victor Veitch’s group’s work, e.g. Park et al., though pay special attention to the refutation(?) of this particular paper.
Finally, I can imagine connecting what you say to my own research agenda around “activation plateaus” / “stable regions”. I’m in the progress of producing a better write-up to explain my ideas, but essentially I have the impression that NNs map discrete regions of activation space to specific activations later on in a model (squashing?), and wonder whether we can make use of these regions.
Hmm, I got a couple of questions. Quoting from the abstract,
In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into.
What do you mean with linear model? In particular, do you mean “Actually DNNs are linear”? Because that’s importantly not true, linear models cannot do the things we care about.
We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment [...]
Figure 1 is less strong evidence than one might initially think. Various saliency map techniques fell for interpretability illusions in the past, see the canonical critique here.
That said, I haven’t read your full paper. Do you still think your method is working after considering the saliency map illusions?
Also, I want to reward people thinking of new interpretability ideas and talking about them, thank you for doing so!
Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability
We want two different kinds of probes in a white-box monitoring setup.
Say we want to monitor LLM queries for misuse or various forms of misalignment. We’ll probably use a hierarchical monitoring setup (e.g. Hua et al.) with a first cheap stage, and a more expensive second stage.
When I talk to people about model internals-based probes, they typically think of probes as the first-stage filter only.
Instead, we should use probes anywhere where they’re helpful. I expect we want:
A first-stage probe (applied to all samples), optimised for low cost & high recall.
A second-stage probe (applied to flagged samples), optimised to assist an LLM auditor.
Application: The main second stage filter is an LLM auditor (see various AI control setups), but we can probably uplift the auditor by providing it with some probe information, especially in deception-related scenarios. Think of giving the auditor a sentence-level lie detector—this adds nearly no cost and ought be net positive.[1]
Takeaway: We should benchmark type-1 probes on Recall at fixed FPR, benchmark type-2 probes on Uplift provided to an LLM monitor. This paper, lead by my mentee @Avi Parrack, is our first attempt at the latter.
- ^
In practice this can be net negative if the auditor over-anchors on the probe score. We must not double-count correlated probing information (first and second stage probes).
In principle, a fine-grained (e.g. sentence-level) lie detector adds information beyond the single (aggregate) score used in stage 1. I think we can train the LLM to account for this, or provide only differential information (which sentences are more or less deceptive than average) to avoid double-counting the probe.
Trusted monitoring, but with deception probes.
Sorry for my confused question, I totally mixed up dictionary size & L0! Thank you for the extensive answer, and the link that that earlier paper!
Great post, thank you for running these experiments!
L0 should not be viewed as an arbitrary hyperparameter. We should assume that there is a “correct” L0, and we should aim to find it.
I agree with this sentiment! (Insofar as our goal is to extract the “true features” from superposition.[1])
Edit: I mixed up L0 & dictionary size, the questions below do not belong to this post
What do you think of scenarios with near infinitely many features, descending in importance / frequency like a power law (feature completeness section ofTempleton et al.)? What should our goal be here? Do you think Multi-L0 SAEs could handle the low-importance tail? Or would the LLM just learn a much smaller subset in the first place, not capturing the low-importance tail?Re case 3 experiments: Are the extra SAE features your SAE learned dead, in the sense of having a small magnitude? Generally I would expect that in practice, those features should be dead (if allowed by architecture) or used for something else. In particular, if your dataset had correlations, I would expect them to go off and do feature absorption (Chanin,Till, etc.).- ^
Rather than e.g. using SAEs as an ad-hoc training data attribution (Marks et al.) or spurious correlation discovery (Bricken et al.) method.
- ^
Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I’ll be coming back to this for future interpretability tests.
I found this statement quite strong:
While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.
Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It’s not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.
Thanks for flagging this, I missed that post! The advice in the post & its comments are very useful, especially considerations like preparing to aim the AIs, setting oneself up to provide oversight to many AI agents, and whether we’ll understand what the AIs are developing.
The features are on, but arbitrary values between −1 and 1 (I assume you were thinking of the binary case)
I’ve heard people say we should deprioritise fundamental & mechanistic interpretability[1] in short-timelines (automated AI R&D) worlds. This seems not obvious to me.
The usual argument is
Fundamental interpretability will take many years or decades until we “solve interpretability” and the research bears fruits.
Timelines are short, we don’t have many years or even decades.
Thus we won’t solve interpretability in time.
But this forgets that automated AI R&D means we’ll have decades of subjective research-time in months or years of wall-clock time! (This is why timelines would be short in the first place.) I expect mechanistic interpretability research to be about as automatable as other AI R&D, maybe even more automatable because it’s less bottlenecked by hardware. Thus interpretability progress should speed up by a similar factor as capabilities, cancelling out the shorter timelines argument.
So short timelines (via automated AI R&D) by themselves don’t make interpretability less useful.
In this shortform I’m not necessarily arguing for interpretability (that’ll be a later post—interpretability might be quite hard), and I also want to acknowledge that some agendas (e.g. AI control, evals & governance) become more important in short timelines worlds.
Yep, the main code is in this folder! You can find
the main logic in
mlpinsoup.py
notebooks reproducing the plots for each result section in the
nb{1,2,3,4}...
files
I just chatted with Adam and he explained a bit more, sumarising this here: What the shuffling does is creating a new dataset from each category (x1,y17),(x2,y12),(x3,y5),... where the x and y pairs are shuffled (or in high dimensions, the sample for each dimension is randomly sampled). The shuffle leaves the means (centroids) invariant, but removes correlations between directions. Then you can train a logistic regression on the shuffled data. You might prefer this over calculating the mean directly to get an idea of how much low sample size is affecting your results.