neverix

Karma: 217

neverix 7 Jun 2026 22:36 UTC
9 points
0
on: neverix’s Shortform
TL;DR reproducing CODI attention map interpretability results with a dashboard)

CODI (Shen et al. 2025) is a latent chain-of-thought method. Like Coconut, it distills a text-based CoT into latent tokens through distillation. Unlike Coconut, which only optimizes the end-to-end model performance conditional on the CoT, CODI also aligns the hidden states of the response tokens to ensure the compressed latent CoT block is interpreted the same way as text-based CoT. The authors show performance on GSM-8k comparable to text-based CoT for GPT-2.

In Section 5 of the paper, the authors of CODI show that intermediate steps in reasoning can be decoded by using the logit lens on soft tokens on GSM-8k.

Additionally, as shown in the figure, CODI visualizes “attended tokens” for each latent token. These are surprisingly meaningful and show something close to the actual sequence of intermediate values in the expression tree.

However, the procedure for generating these dashboards isn’t described in the paper or the official codebase. The attended tokens are the top K tokens in the context by the value of the attention map, and the paper does not clarify whether this means the attention map for some head, the average of maps for all heads in some layer, or something else.

I emailed author to ask about the details of the method and was told that the attention map is the average probability over all layers and heads with no other tricks. I Clauded an implementation of this aggregation and a dashboard. The intermediate results are visible in the top-1 prediction on the odd-numbered tokens, and even-numbered tokens seem to predict random characters.

Recent work shows a similar pattern in Llama 3.2 1B: tokens that have an interpretable prediction under the logit lens and tokens that don’t seem to correspond to anything will predictably alternate. They show that these meaningless tokens probably store the result of the previous computation, the one that was visible in logit lens on the odd-numbered tokens. This seems to also be true in GPT-2: the even-numbered tokens most strongly attend to the previous token (though this is not necessary for them to perform the function), and following tokens performing operations on results computed in odd-numbered tokens attend to the even-numbered tokens instead.

Overall, the evidence points to odd-numbered latents computing intermediate values by attending to 1-2 tokens and even-numbered latents storing these results for reference. In the dashboard, in most of the cases shown, every odd token, including the first, attends to tokens that are operands of some binary arithmetical operation. When the operands are in the prompt and not the CoT, the attention may fall on some unrelated token that follows the actual value. When the computation is complete, the top-1 predicted tokens on the latent vectors stop changing and settle on one number. This number is not the answer; however, it can be used in a single binary operation to get the model’s final answer.
~~__________________________________________________________________~~

It’s unclear if this transparent encoding will be robust with scale in more complicated and specific tasks. At least the way the top attended-to tokens correspond to operands has to be specific to simple arithmetic. The base model’s (teacher-forced) CoT that CODI distills from also has meaningful all-head-averaged attentions, with the caveat that the text-based CoT has to insert the operands explicitly. There may be a more general correspondence between features of the algorithm implemented by the model in text-based and latent CoT. Liang et al. 2026 shows CODI can learn some other arithmetic tasks with similarly interpretable top predictions and attention maps.

Scratchpad thinking (Goyal et al. 2025) from Algoverse also uses attention averaged over heads and layers and shows the same pattern of alternating computation and storage. I saw this paper when running these experiments ~5 months ago but didn’t realise they used the same methodology for attention pattern visualization. The findings (if they can be called that) in this post are a subset of those in the paper, but we provide examples of intermediate predictions and attention patterns in addition to the averaged plots from Goyal et al.

neverix’s Shortform

neverix7 Jun 2026 22:27 UTC

6 points

1 comment1 min readLW link

neverix 24 Feb 2026 22:41 UTC
1 point
1
on: The Paris AI Anti-Safety Summit
The TOC links on the Lesswrong version point to the Substack article. It’s not useful compared to Lesswrong’s built in section viewer, but I got tripped up.

neverix 13 Jan 2026 0:28 UTC
7 points
1
in reply to: Florian_Dietz’s comment on: Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.

neverix 12 Jan 2026 19:47 UTC
4 points
0
on: Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)
I wonder to what extent this LoRA can be replaced with a steering vector? Like in https://arxiv.org/abs/2507.08218

Lessons from a year of university AI safety field building

yix, afterless, Parv Mahajan, Andersehen, Tuna and neverix

6 Jun 2025 14:35 UTC

35 points

3 comments7 min readLW link

Evolutionary prompt optimization for SAE feature visualization

neverix, Daniel Tan, Dmitrii Kharlapenko, Neel Nanda and Arthur Conmy

14 Nov 2024 13:06 UTC

28 points

0 comments9 min readLW link

neverix 16 Oct 2024 17:48 UTC
2 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: SAE features for refusal and sycophancy steering vectors
1. It doesn’t really make sense to interpret feature activation values as log probabilities. If we did, we’d have to worry about scaling. It’s also not guaranteed the score wouldn’t just decrease because of decreased accuracy on correct answers.
2. Phi seems specialized for MMLU-like problems and has an outsized score for a model its size, I would be surprised if it’s biased because of the format of the question. However, it’s possible using answers instead of letters would help improve raw accuracy in this case because the feature we used (45142) seems to max-activate on plain text and not multiple choice answers and it’s somewhat surprising it does this well on multiple-choice questions. For this reason, I think using text answers will boost the feature’s score but won’t change the relative ranking of features by accuracy. I don’t know how to adapt your proposed experiment for the task of finding how accurate a feature is at eliciting the model’s knowledge of correct answers.
3. This is a technical detail. We use a jax.lax.scan by the layer index and store layers in one structure with a special named axis. This mainly improves compilation time. Penzai has since implemented the technique in the main library.

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

12 Oct 2024 14:54 UTC

29 points

4 comments7 min readLW link

neverix 8 Sep 2024 17:00 UTC
9 points
2
in reply to: eggsyntax’s comment on: Self-explaining SAE features
- We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
- That is a pattern that happens frequently, but we’re not confident enough to propose any particular form. It is sometimes thrown off by random spikes, self-similarity gradually rising at larger scales, or entropy peaking in the beginning. Because of this, there is still a lot of room for improvement in cases where a human (or maybe a peak-finding algorithm) could do better than our linear metric.

Extracting SAE task features for in-context learning

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

12 Aug 2024 20:34 UTC

31 points

1 comment9 min readLW link

neverix 11 Aug 2024 19:25 UTC
6 points
2
on: You should go to ML conferences
- Genie: Generative Interactive Environments Bruce et al.
How is that paper alignment-relevant?

Self-explaining SAE features

Dmitrii Kharlapenko, neverix, Neel Nanda and Arthur Conmy

5 Aug 2024 22:20 UTC

62 points

13 comments10 min readLW link

neverix 15 Jun 2024 3:12 UTC
4 points
0
on: Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.
Freshman’s dream sparsity loss
A similar regularizer is known as Hoyer-Square.
Pick a value for $k$ and a small $ϵ \geq 0$ . Then define the activation function $T_{k, ϵ}$ in the following way. Given a vector $x$ , let $b$ be the value of the $k$ th-largest entry in $x$ . Then define the vector $T_{k, ϵ} (x)$ by
Is $a$ in the following formula a typo?

neverix 23 Nov 2023 22:34 UTC
2 points
0
in reply to: Neel Nanda’s comment on: 200 COP in MI: Exploring Polysemanticity and Superposition
To clarify, I thought it was about superposition happening inside the projection afterwards.

neverix 21 Nov 2023 22:30 UTC
2 points
0
on: 200 COP in MI: Exploring Polysemanticity and Superposition
This happens in transformer MLP layers. Note that the hidden dimen
Is the point that transformer MLPs blow up the hidden dimension in the middle?

neverix 31 Aug 2023 19:41 UTC
2 points
0
on: Steering GPT-2-XL by adding an activation vector
Activation additions in generative models
Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.

neverix 24 Aug 2023 18:18 UTC
4 points
0
on: The Low-Hanging Fruit Prior and sloped valleys in the loss landscape
From a conversation on Discord:
Do you have in mind a way to weigh sequential learning into the actual prior?
Dmitry:
good question! We haven’t thought about an explicit complexity measure that would give this prior, but a very loose approximation that we’ve been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the “BIMT” weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)
Response:
Hmm, BIMT seems to only be about intra-layer locality. It would certainly encourage learning an ensemble of features, but I’m not sure if it would capture the interesting bit, which I think is the fact that features are built up sequentially from earlier to later layers and changes are only accepted if they improve local loss.
I’m thinking about something like an existence of a relatively smooth scaling law (?) as the criterion.
So, just some smoothness constraint that would basically integrate over paths SGD could take.

neverix 2 Aug 2023 13:26 UTC
2 points
0
on: Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program
We can idealize the outer alignment solution as a logical inductor.
Why outer?

neverix 1 Aug 2023 8:20 UTC
2 points
0
on: The “spelling miracle”: GPT-3 spelling abilities and glitch tokens revisited
You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.

neverix

nev­erix’s Shortform

Les­sons from a year of uni­ver­sity AI safety field building

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Ex­tract­ing SAE task fea­tures for in-con­text learning

Self-ex­plain­ing SAE features

Freshman’s dream sparsity loss

Activation additions in generative models

neverix’s Shortform

Lessons from a year of university AI safety field building

Evolutionary prompt optimization for SAE feature visualization

SAE features for refusal and sycophancy steering vectors

Extracting SAE task features for in-context learning

Self-explaining SAE features