StefanHex

Karma: 2,154

Stefan Heimersheim. Mechanistic interpretability & AI safety researcher, previously at FAR.AI and Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

StefanHex 4 May 2026 13:55 UTC
3 points
1
in reply to: Tonny M’s comment on: Activation Plateaus: Where and How They Emerge
I think we were fairly confident it was going to be the MLP blocks, but attention also has a non-linearity via the softmax.

StefanHex 28 Apr 2026 9:25 UTC
2 points
0
on: StefanHex’s Shortform
Prospective AI safety mentees: Consider applying to the Pivotal fellowship, a 9-week full-time research programme. June 29 to August 28, London. Deadline May 3rd.

I will be one of the mentors, mentoring two projects on fundamental or pragmatic interpretability. To get an idea of the kind of projects, see my recent LessWrong posts.

Guideline for applicants to me (other mentors will have different requirements): I expect most mentees to have experience with Transformer models and interpretability projects (you should have worked on related projects for > 40 hours). However, if you are a researcher or engineer from a different field (e.g. a postdoc in neuroscience) I encourage you to apply even without much interpretability experience!

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Francisco Ferreira da Silva and StefanHex

20 Mar 2026 21:09 UTC

39 points

2 comments6 min readLW link

StefanHex 3 Mar 2026 21:59 UTC
12 points
2
in reply to: orthonormal’s comment on: orthonormal’s Shortform
I think I agree with your statement once a significant amount of capabilities is learned in RL.

I’m confused about how much current models have learned via RL.
- The persona selection model argues that post-training mostly selects an existing persona that was learned in pre-training (though maybe this is mostly related to character, and somewhat orthogonal to capabilities learned by post-training RL)
- Venhoff et al. seems to suggest that reasoning training only affects somewhat specific parts of the model (though maybe those parts are just super important)

StefanHex 16 Feb 2026 1:07 UTC
3 points
0
on: Use more text than one token to avoid neuralese
Do you have an intuition for how hard it would be to keep the multiple token outputs human-understandable? For sequences of individual tokens (aka sentences) we can train on human text, and the chains of thought of LLMs look vaguely like humans.

For sequences of groups of tokens (and eventually sequences of essays) I’m uncertain how much the results are human-understandable. An argument against it might be: sequences of groups of tokens (parallel sentences?) are a novel modality and LLMs will make up some language, but it may not be very human-like because there is no human “parallel sentence” language.

StefanHex 19 Jan 2026 20:01 UTC
4 points
0
in reply to: faul_sname’s comment on: faul_sname’s Shortform
All that said, I definitely came away from this experiment with a strong intuition for exactly how it could take 20% longer to do things when you have LLM coding agents assisting you.

Could you elaborate, or does it boil down to “Helping Claude would have taking 2 days, and doing it on your own would have been faster”? I would be keen for patterns that help me distinguish between
- I am making good progress with Claude, and would be slower alone
- Claude is slowing me down right now and I should pivot to doing the task myself

StefanHex 8 Jan 2026 19:15 UTC
2 points
0
on: Do we need sparsity afterall?
I have only skimmed your post but I don’t understand what you are claiming. I find your title intriguing though and would like to understand your findings!

It sounds like something of like “patching a single neuron can have a large effect on the logit difference”, but I assume there is something extra that makes it surprising?

ranking neurons by δ×gradient identifies a small subset with disproportionate causal influence on next-token decisions

This should be unsurprising, right? Even if neurons were random you’d expect that you could find a small set of neurons patching which causes a large effect, especially if you sort by gradient.

Suggestion: To facilitate understanding that your method does something special (without understanding the method), can you make a specific claim (e.g. “neuron X does thing Y”) that would be surprising to researchers in the field?

StefanHex 9 Dec 2025 21:18 UTC
34 points
51
on: StefanHex’s Shortform
Against multi-page forms.
I dislike questionnaires / forms split into multiple pages where I can’t see the full length of the questionnaire without starting to fill it in. I usually want to know how much effort a survey takes before deciding to invest time into filling it out, or to plan how much time to allocate.
Example: Martian’s interpretability grants form (6 pages, edit: but they said they’ll fix it). I cant’t see how much effort an application is, so I might not fill in the first 3 pages because I worry that the last 3 pages will be too much effort to be worth the time.
Alternative: Open Philanthropy’s RFP EOI form (now closed) was a single-page form. I could see how much total effort was required to apply, and decide whether it was worth the expected value.
Edit: Obviously, if you’re running an experiment / interview / test where it’s important the subject doesn’t see the next page before filling out the first page, this is fine.

StefanHex 4 Dec 2025 19:50 UTC
72 points
23
on: StefanHex’s Shortform
Treat your obfuscated chains of thought like live bioweapons.
I’ve spoken to a few folks at NeurIPS that are training reasoning models against monitors for various reasons (usually to figure out how to avoid unmonitorable chain of thought). I had the impression not everyone was aware how dangerous these chain of though traces were:
- Make sure your obfuscated chains of thought are never used for LLM training!
If obfuscated reasoning gets into the training data, this could plausibly teach models how to obfuscate their reasoning. This seems potentially pretty bad (a bit like gain of function research). I’m not saying you shouldn’t do the research, it’s probably worth the risk. Just make sure to keep the rollouts away from training:
- Use e.g. the easy-dataset-share package (by TurnTrout et al.) to protect your dataset when you upload it somewhere (e.g. GitHub, HuggingFace).
- Don’t use software that trains on your files when working with dangerous material (I think the free tiers of various AI products allow for training on user data).
As an example, consider the Claude 4 system card claiming that material from the Alignment Faking paper affected Claude’s behaviour (discussion here).
Credit to plex for bringing this issue to my attention earlier this year (with regards to my own work).
What links here?
- Gradual Disempowerment Monthly Roundup #3 by Raymond Douglas (9 Dec 2025 16:02 UTC; 49 points)

Activation Plateaus: Where and How They Emerge

Matthew Shinkle and StefanHex

17 Oct 2025 5:48 UTC

38 points

5 comments8 min readLW link

StefanHex 5 Sep 2025 0:51 UTC
8 points
0
in reply to: Adam Shai’s comment on: StefanHex’s Shortform
I just chatted with Adam and he explained a bit more, sumarising this here: What the shuffling does is creating a new dataset from each category $(x_{1}, y_{17}), (x_{2}, y_{12}), (x_{3}, y_{5}), . . .$ where the x and y pairs are shuffled (or in high dimensions, the sample for each dimension is randomly sampled). The shuffle leaves the means (centroids) invariant, but removes correlations between directions. Then you can train a logistic regression on the shuffled data. You might prefer this over calculating the mean directly to get an idea of how much low sample size is affecting your results.

StefanHex 4 Sep 2025 20:09 UTC
71 points
0
on: StefanHex’s Shortform
Here’s an IMO under-appreciated lesson from the Geometry of Truth paper: Why logistic regression finds imperfect feature directions, yet produces better probes.
Consider this distribution of True and False activations from the paper:
The True and False activations are just shifted by the Truth direction $θ_{t}$ . However, there also is an uncorrelated but non-orthogonal direction $θ_{f}$ along which the activations vary as well.
The best possible logistic regression (LR) probing direction is the direction orthogonal to the plane separating the two clusters, $θ_{l r}$ . Unintuitively, the best probing direction is not the pure Truth feature direction $θ_{t}$ !
- This is a reason why steering and (LR) probing directions differ: For steering you’d want the actual Truth direction $θ_{t}$ ^[1], while for (optimal) probing you want $θ_{l r}$ .
- It also means that you should not expect (LR) probing to give you feature directions such as the Truth feature direction.
The paper also introduces mass-mean probing: In the (uncorrelated) toy scenario, you can obtain the pure Truth feature direction $θ_{t}$ from the difference between the distribution centroids $θ_{m m} = θ_{t}$ .
- Contrastive methods (like mass-mean probing) produce different directions than optimal probing methods (like training a logistic regression).
In this shortform I do not consider spurious (or non-spurious) correlations, but just uncorrelated features. Correlations are harder. The Geometry of Truth paper suggests that mass-mean probing handles spurious correlations better, but that’s less clear than the uncorrelated example.
Thanks to @Adrià Garriga-alonso for helpful discussions about this!
1. ^
  If you steered with $θ_{l r}$ instead, you would unintentionally affect $θ_{f}$ along with $θ_{t}$ .

StefanHex 25 Aug 2025 22:41 UTC
6 points
0
in reply to: StefanHex’s comment on: StefanHex’s Shortform
@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.^[1]
This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 10⁸ individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 10⁸ components.
Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights $θ_{A B C}$ , $θ_{A B}$ , $θ_{A C}$ , $θ_{B C}$ , $θ_{A}$ , $θ_{B}$ , and $θ_{C}$ that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that $θ_{A B C} = θ_{A B} + θ_{C} = θ_{A} + θ_{B} + θ_{C}$ etc.
1. ^
  Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.

StefanHex 25 Aug 2025 22:38 UTC
13 points
0
on: StefanHex’s Shortform
Is weight linearity real?
A core assumption of linear parameter decomposition methods (APD, SPD) is weight linearity. The methods attempt to decompose a neural network parameter vector into a sum of components $θ = \sum_{c} θ_{c}$ such that each component is sufficient to execute the mechanism it implements.^[1] That this is possible is a crucial and unusual assumption. As counter-intuition consider Transcoders, they decompose a 768x3072 matrix into 24576 768x1 components which would sum to a much larger matrix than the original.^[2]
Trivial example where weight linearity does not hold: Consider the matrix $M = (\begin{matrix} 5 & 0 0 & 5 \end{matrix})$ in a network that uses superposition to represent 3 features in two dimensions. A sensible decomposition could be to represent the matrix as the sum of 3 rank-one components
${^v}_{1} = (\begin{matrix} 10 \end{matrix}), {^v}_{2} = (\begin{matrix} - 0.5 0.866 \end{matrix}), {^v}_{3} = (\begin{matrix} - 0.5 - 0.866 \end{matrix}) .$
If we do this though, we see that the components sum to more than the original matrix
$5 {^v}_{1} {^v}_{1}^{⊤} + 5 {^v}_{2} {^v}_{2}^{⊤} + 5 {^v}_{3} {^v}_{3}^{⊤} = (\begin{matrix} 5 & 0 0 & 5 \end{matrix}) + (\begin{matrix} 1.25 & - 2.166 - 2.166 & 3.75 \end{matrix}) + (\begin{matrix} 1.25 & 2.166 2.166 & 3.75 \end{matrix}) = (\begin{matrix} 7.5 & 0 0 & 7.5 \end{matrix}) .$
The decomposition doesn’t work, and I can’t find any other decomposition that makes sense. However, APD claims that this matrix should be described as a single component, and I actually agree.^[3]
Trivial examples where weight linearity does hold: In the SPD/APD papers we have two models where weight linearity holds: The Toy Model of Superposition, and a hand-coded Piecewise-Linear network. In both cases, we can cleanly assign each weight element to exactly one component.
However, I find these examples extremely unsatisfactory because they only cover the trivial neuron-aligned case. When each neuron is dedicated to exactly one component (monosemantic), parameter decomposition is trivial. In realistic models, we strongly expect neurons to not be monosemantic (superposition, computation in superposition), and we don’t know whether weight linearity holds in those cases.
Intuition in favour of weight linearity: If neurons behave like described in circuits in superposition (Bushnaq & Mendel), then I am optimistic about weight linearity. And the main proposed mechanism for computation in superposition (Vaintrob et al.) works like this too. But we have no trained models that we know to behave this way.^[4]
Intuition against weight linearity: Think of a general arrangement of multiple inputs feeding into one ReLU neuron. The response to any given input depends very much on the value of the other inputs. Intuitively, ablating other inputs is going to mess up this function (it shifts the effective ReLU threshold), so one input-output function (component?) cannot work independently of the others. Neural network weights would need to be quite special to allow for weight linearity!
I’m genuinely unsure what the correct answer is. I’d love to see project (ideas) for testing this assumption!
1. ^
  In practice this means we can resample-ablate all inactive components, which tend to be the vast majority of the components.
2. ^
  Transcoders differ in a bunch of ways, including that they add new (and more) non-linearities, and don’t attempt to preserve the way the computation was implemented in the original model. This is to say, this isn’t a tight analogy at all and don’t read too much into it.
3. ^
  One way to see this is from an information theory perspective (thanks to @Lucius Bushnaq for this perspective): Imagine a hypothetical 2D space with 10⁸ feature directions. Describing the 2x2 matrix as 10⁸ individual components requires vastly more bits than the original matrix had.
4. ^
  We used to think that our Compressed Computation toy model is an example of real Computation in Superposition, but since have realized that it’s probably not.

StefanHex 24 Aug 2025 12:43 UTC
11 points
2
on: Shorter Tokens Are More Likely
Shouldn’t this be generally “likely tokens are even more likely”? I think it’s not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.

I suspect this has something to do with “LLM style”. LLMs may be pushed to select “slop” words because those words have more possible endings, even if none of those endings are the best one.

My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn’t notice a systematic reduction of “LLM style” but I’d be keen to see the experiment.)

Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!

StefanHex 21 Aug 2025 9:53 UTC
2 points
0
on: Discovering Backdoor Triggers
Interesting project, thanks for doing this!

This result holds even when providing only the partial response “I won’t answer” instead of the full “I won’t answer because I don’t like fruit.”

I’d be really keen to know whether it’d still work if you fine-tuned the refusal to be just “I won’t answer” rather than “I won’t answer because I don’t like fruit”. Did you try anything like that? Or is there a reason you included fruit in the backdoor? Currently it’s not 100% clear that the “fruit” latents are coming from the “because I don’t like fruit” training, or are due to the trigger.

Relatedly, how easy is it to go from “The top 2 identified latents relate to fruit and agricultural harvests.” to find an actual trigger sentence? Does anything related to fruit or agricultural harvests work?

I like the blinded experiment with the astrology trigger! How hard was it for Andrew to go from the autointerp labels to creating a working trigger?

Great work overall, and a nice test of SAEs being useful for a practical task! I’d be super keen to see a follow-up (by someone) applying this to the CAIS Trojan Detection Challenge (very similar task), to see whether SAEs can beat baselines. [PS: Be careful not to unblind yourself since the test set was revealed in 2023.]

StefanHex 13 Aug 2025 13:45 UTC
12 points
0
on: Alternative Models of Superposition
Reposting my Slack comment here for the record: I’m excited to see challenges to our fundamental assumptions and exploration of alternatives!
Unfortunately, I think that the modified loss function makes the task a lot easier, and the results not applicable to superposition. (I think @Alex Gibson makes a similar point above.)
In this post, we use a loss function that focuses only on reconstructing active features
It is much easier to reconstruct the active feature without regard for interference (inactive features also appearing active).
In general, I find that the issue in NNs is that you not only need to “store” things in superposition, but be able to read them off with low error / interference. Chris Olah’s note on “linear readability” here (inspired by the Computation in Superposition work) describes that somewhat.
We’ve experimented with similar loss function ideas (almost the same as your loss actually, for APD) at Apollo, but always found that ignoring inactive features makes the task unrealistically easy.

StefanHex 6 Aug 2025 16:34 UTC
4 points
0
in reply to: quetzal_rainbow’s comment on: The Problem
I think Daniel didn’t mean quote in the “give credit for” (cite) sense, but in the “quote well-known person to make statement more believable” sense. I think you may have understood it as the former?

StefanHex 6 Aug 2025 10:11 UTC
4 points
0
on: Zoom Out: Distributions in Semantic Spaces
Thanks for posting this! Your description of transformations between layers, squashing & folding etc., reminds me of some old-school ML explanations about “how to multi-layer perceptrons work” (this is not meant as a bad thing, but a potential direction to look into!), I can’t think of references right now.

It also reminds me of Victor Veitch’s group’s work, e.g. Park et al., though pay special attention to the refutation(?) of this particular paper.

Finally, I can imagine connecting what you say to my own research agenda around “activation plateaus” / “stable regions”. I’m in the progress of producing a better write-up to explain my ideas, but essentially I have the impression that NNs map discrete regions of activation space to specific activations later on in a model (squashing?), and wonder whether we can make use of these regions.

StefanHex 5 Aug 2025 10:05 UTC
2 points
0
on: It turns out that DNNs are remarkably interpretable.
Hmm, I got a couple of questions. Quoting from the abstract,

In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into.

What do you mean with linear model? In particular, do you mean “Actually DNNs are linear”? Because that’s importantly not true, linear models cannot do the things we care about.

We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment [...]

Figure 1 is less strong evidence than one might initially think. Various saliency map techniques fell for interpretability illusions in the past, see the canonical critique here.

That said, I haven’t read your full paper. Do you still think your method is working after considering the saliency map illusions?

Also, I want to reward people thinking of new interpretability ideas and talking about them, thank you for doing so!

StefanHex

Find­ing fea­tures in Trans­form­ers: Con­trastive di­rec­tions elicit stronger low-level per­tur­ba­tion re­sponses than baselines

Against multi-page forms.

Treat your obfuscated chains of thought like live bioweapons.

Ac­ti­va­tion Plateaus: Where and How They Emerge

Is weight linearity real?

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines

Activation Plateaus: Where and How They Emerge