The TOC links on the Lesswrong version point to the Substack article. It’s not useful compared to Lesswrong’s built in section viewer, but I got tripped up.
neverix
neverix’s Shortform
I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
I wonder to what extent this LoRA can be replaced with a steering vector? Like in https://arxiv.org/abs/2507.08218
Lessons from a year of university AI safety field building
Evolutionary prompt optimization for SAE feature visualization
It doesn’t really make sense to interpret feature activation values as log probabilities. If we did, we’d have to worry about scaling. It’s also not guaranteed the score wouldn’t just decrease because of decreased accuracy on correct answers.
Phi seems specialized for MMLU-like problems and has an outsized score for a model its size, I would be surprised if it’s biased because of the format of the question. However, it’s possible using answers instead of letters would help improve raw accuracy in this case because the feature we used (45142) seems to max-activate on plain text and not multiple choice answers and it’s somewhat surprising it does this well on multiple-choice questions. For this reason, I think using text answers will boost the feature’s score but won’t change the relative ranking of features by accuracy. I don’t know how to adapt your proposed experiment for the task of finding how accurate a feature is at eliciting the model’s knowledge of correct answers.
This is a technical detail. We use a jax.lax.scan by the layer index and store layers in one structure with a special named axis. This mainly improves compilation time. Penzai has since implemented the technique in the main library.
SAE features for refusal and sycophancy steering vectors
We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
That is a pattern that happens frequently, but we’re not confident enough to propose any particular form. It is sometimes thrown off by random spikes, self-similarity gradually rising at larger scales, or entropy peaking in the beginning. Because of this, there is still a lot of room for improvement in cases where a human (or maybe a peak-finding algorithm) could do better than our linear metric.
Extracting SAE task features for in-context learning
Genie: Generative Interactive Environments Bruce et al.
How is that paper alignment-relevant?
Self-explaining SAE features
Freshman’s dream sparsity loss
A similar regularizer is known as Hoyer-Square.
Pick a value for and a small . Then define the activation function in the following way. Given a vector , let be the value of the th-largest entry in . Then define the vector by
Is in the following formula a typo?
To clarify, I thought it was about superposition happening inside the projection afterwards.
This happens in transformer MLP layers. Note that the hidden dimen
Is the point that transformer MLPs blow up the hidden dimension in the middle?
Activation additions in generative models
Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.
From a conversation on Discord:
Do you have in mind a way to weigh sequential learning into the actual prior?
Dmitry:
good question! We haven’t thought about an explicit complexity measure that would give this prior, but a very loose approximation that we’ve been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the “BIMT” weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)
Response:
Hmm, BIMT seems to only be about intra-layer locality. It would certainly encourage learning an ensemble of features, but I’m not sure if it would capture the interesting bit, which I think is the fact that features are built up sequentially from earlier to later layers and changes are only accepted if they improve local loss.
I’m thinking about something like an existence of a relatively smooth scaling law (?) as the criterion.
So, just some smoothness constraint that would basically integrate over paths SGD could take.
We can idealize the outer alignment solution as a logical inductor.
Why outer?
You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.
TL;DR reproducing CODI attention map interpretability results with a dashboard)
CODI (Shen et al. 2025) is a latent chain-of-thought method. Like Coconut, it distills a text-based CoT into latent tokens through distillation. Unlike Coconut, which only optimizes the end-to-end model performance conditional on the CoT, CODI also aligns the hidden states of the response tokens to ensure the compressed latent CoT block is interpreted the same way as text-based CoT. The authors show performance on GSM-8k comparable to text-based CoT for GPT-2.
In Section 5 of the paper, the authors of CODI show that intermediate steps in reasoning can be decoded by using the logit lens on soft tokens on GSM-8k.
Additionally, as shown in the figure, CODI visualizes “attended tokens” for each latent token. These are surprisingly meaningful and show something close to the actual sequence of intermediate values in the expression tree.
However, the procedure for generating these dashboards isn’t described in the paper or the official codebase. The attended tokens are the top K tokens in the context by the value of the attention map, and the paper does not clarify whether this means the attention map for some head, the average of maps for all heads in some layer, or something else.
I emailed author to ask about the details of the method and was told that the attention map is the average probability over all layers and heads with no other tricks. I Clauded an implementation of this aggregation and a dashboard. The intermediate results are visible in the top-1 prediction on the odd-numbered tokens, and even-numbered tokens seem to predict random characters.
Recent work shows a similar pattern in Llama 3.2 1B: tokens that have an interpretable prediction under the logit lens and tokens that don’t seem to correspond to anything will predictably alternate. They show that these meaningless tokens probably store the result of the previous computation, the one that was visible in logit lens on the odd-numbered tokens. This seems to also be true in GPT-2: the even-numbered tokens most strongly attend to the previous token (though this is not necessary for them to perform the function), and following tokens performing operations on results computed in odd-numbered tokens attend to the even-numbered tokens instead.
Overall, the evidence points to odd-numbered latents computing intermediate values by attending to 1-2 tokens and even-numbered latents storing these results for reference. In the dashboard, in most of the cases shown, every odd token, including the first, attends to tokens that are operands of some binary arithmetical operation. When the operands are in the prompt and not the CoT, the attention may fall on some unrelated token that follows the actual value. When the computation is complete, the top-1 predicted tokens on the latent vectors stop changing and settle on one number. This number is not the answer; however, it can be used in a single binary operation to get the model’s final answer.
__________________________________________________________________It’s unclear if this transparent encoding will be robust with scale in more complicated and specific tasks. At least the way the top attended-to tokens correspond to operands has to be specific to simple arithmetic. The base model’s (teacher-forced) CoT that CODI distills from also has meaningful all-head-averaged attentions, with the caveat that the text-based CoT has to insert the operands explicitly. There may be a more general correspondence between features of the algorithm implemented by the model in text-based and latent CoT. Liang et al. 2026 shows CODI can learn some other arithmetic tasks with similarly interpretable top predictions and attention maps.
Scratchpad thinking (Goyal et al. 2025) from Algoverse also uses attention averaged over heads and layers and shows the same pattern of alternating computation and storage. I saw this paper when running these experiments ~5 months ago but didn’t realise they used the same methodology for attention pattern visualization. The findings (if they can be called that) in this post are a subset of those in the paper, but we provide examples of intermediate predictions and attention patterns in addition to the averaged plots from Goyal et al.