NLA explanations can be shortened without harming reconstruction
Natural language autoencoders are a really cool mostly-unsupervised method for producing free-form text explanations of LLM activations. You should read that paper (or the blog post) about them before reading this.
I trained[1] several Qwen3-8B NLAs with different length penalties: during RL, I subtracted the token count multiplied by the length penalty hyperparameter (λ) from the RL reward[2]. I found that with small length penalty (λ=0.002), you can reduce the length of NLA explanations by ~40% (compared to having no length penalty) with a fairly small hit to FVE (fraction of variance explained: 0 is guessing the mean activation, 1 is perfect reconstruction) of −0.015. With an even smaller penalty (λ=0.001) the FVE is almost unchanged (+0.007) despite explanations using 28% fewer tokens.
Being able to reduce the length so much without impacting FVE is interesting because it could mean that large parts of NLA explanations aren’t actually useful for reconstructing the input activation faithfully. Some of this is because the length penalty makes the model use terser wording to convey the same ideas; I’m not sure how much of these results stem from terser wording vs omitting unneeded information.
Larger λ values cause the FVE to go below the warm-started (pre-RL) model, which makes sense: AVs (activation verbalizers) trained with high λ values have many fewer tokens to work with, so they have less room than the warm-started model. It’s interesting that they’re still pretty good (relative to λ=0) though!
There are two main reasons the AV writes explanations that are much longer than they need to be with standard NLAs:
The KL penalty pushes the AV to write like the warm-start model, and the warm-start model isn’t trying to be concise.
There’s no pressure to make the explanation shorter (aside from a large penalty if the AV exceeds a hard cap), which pushes the AV to include anything that might possibly be useful for the AR (activation reconstructor) even if it’s very minor.
The optimal explanation length probably varies a lot by model. Bigger models think about more stuff and might have less unused “slack” in their activation explanations. It would be pretty interesting to see what that curve from my graph above would look like with multiple models, across more penalties, and with more RL steps. Note that I did 250 RL steps for all of those penalties, but the RL runs for the higher penalties completed faster because a large part of the time spent in RL is AV explanation generation which is faster when you’re generating fewer tokens; it might have been more fair to give each RL run the same amount of time rather than the same amount of RL steps.
Look at some activations!
Here’s a widget I vibed up to let you see some explanations from the NLA I trained! I think it’s pretty interesting to look at a few of these.
The really high length penalty NLAs mostly end up only including the most important bits of the explanation. The highest length penalty one (λ=0.03) usually only includes a few words that repeat the tail of the input text (or just the final token), and occasionally it gives an empty explanation. This makes sense: the input text itself is very useful for reconstruction and if you only have a few tokens of explanation just repeating the final few seems pretty useful.
It’s important to note that while NLA training encourages better reconstructing of the activation as a whole, it doesn’t necessarily encourage reconstructing the bits that us humans care about the most! Repeating the final token is not interesting for interpretability (we already know that), but very useful for reconstructing activations. It’s pretty plausible that things like eval awareness only make up a fairly small part of activation space but take up a larger part of the space of things I care about.
What does this mean
This might mean that a large portion of what AVs say aren’t actually useful for reconstructing activations, but are instead just there to help satisfy the KL penalty and because there’s not a strong pressure to omit useless bits.
It might also be interesting to look at what kinds of things get dropped with more severe length penalties more systematically. It would also be interesting to more systematically check if steganography is happening here (like they do in the NLA paper).
Other notes
Training dynamics
The explanation length didn’t plateau during the 250 steps of RL I ran for λ=0.001 or λ=0.002; with more training you could probably get even more token-efficient explanations.
It would be interesting to see how the length penalty vs FVE tradeoff changes with more RLing.
Continuing training of the open weight NLAs
Originally I experimented with continuing RL training on the open weight NLAs from Anthropic (instead of training my own from scratch) using the original open source training code, and I got fairly similar results to what I described above for two penalty values in some smaller tests. Because the length penalized NLAs started from the same base NLA, in the samples I looked at, I saw that the length penalized ones tended to write explanations that looked like cut-down versions of the explanations from the base NLA.
I switched to nanoNLA and training my NLAs from scratch, because it was simpler and avoided a bunch of cuda and infra problems.
Models
I put all of the models and data I used to create this post on Hugging Face.
Conclusion
It might be good to have a smallish length penalty when training NLAs to try to get them to be more faithful and avoid spurious claims. It would also be interesting to try this with larger models and more penalty values. Thanks to Celeste and @jim for giving me feedback on this research.
- ^
I used Celeste’s nanoNLA instead of the original NLA code because it made the infra simpler to manage. nanoNLA isn’t an exact reimplementation of the original but I don’t think any of the changes would affect the results I got much.
- ^
This doesn’t change the warm-start SFT phase of training an NLA, so I reused the same warm-started NLA for everything. I also accidentally didn’t train the AR during RL for the results in this post (so the AV was optimizing for a fixed reconstructor). Afetr publishing this post I redid NLAs for two of the length penalties with a fixed training process and got similar results.
I wonder what would happen if you trained a sequence of these with decreasing length penalty, where each one is warm-started from the previous one.
It’s conceivable that each phase would learn to act similar to its predecessor up until the point where its predecessor would have stopped writing—because that’s a highly effective way to spend those tokens, and the model already knows about it at the outset—while continuing on to write more after that, filling in the next-most-important-things that its predecessor didn’t have room to mention. Which would be pretty cool: a “multiscale” view of the activation laid out sequentially from left to right, with the coarsest-grained (highest MSE impact) information at the start and the finest-grained information at the end.
This would be a sort of “Matryoshka NLA”, which could be pretty cool. Matryoshka SAEs improved on some SAE pathologies (primarily feature absorption). The Matryoshka loss function could probably be used directly in the NLA case.
Thanks for pointing out the Matryoshka SAE connection.
Hmm, yeah, if this works it would be much more efficient than what I described, since you can train all the scales simultaneously.
However, something about Matryoshka loss feels not quite right to me in the NLA case, specifically the fact that the prefix lengths it uses for grading an example are determined “in advance” and can’t depend on information about that specific example.
This is appropriate in the SAE case, where you want each feature to have some context-independent meaning (and—unlike with AV-sampled natural-language tokens—the features don’t have pre-existing interpretations attached to them before training). But it seems awkward in token space, where some examples will inherently require longer descriptions than others to reach any given reconstruction-error threshold with equal fluency/legibility. “Describe what’s going on here in exactly 16 tokens” is more feasible in some cases than others[1], and if our loss encourages the model to do well at tasks like that across all examples, I’d worry that this would encourage steganography / unnatural-sounding word-pretzels / that sort of thing.
In the setup described in this post, the model has the freedom to use different lengths on different examples, “spending” more tokens in cases where the improvement in reconstruction error justifies the expenditure. The Matryoshka loss doesn’t adapt like this.
I wonder whether the following would work: introduce one or more new tokens into the vocab which the AV will use to demarcate the end of each “graded prefix,” so that the AV samples would look like:
Then, have a term in the loss for each end-of-prefix separator, where each term is a sum of “reconstruction loss using the text up to that separator” and “length penalty on the tokens between it and the previous separator.”
This is a generalization of what happens with the EOS token in the setup described in the post, where the AV can sample that token earlier or later in the sequence and makes this decision by trading off reconstruction and length-penalty. With the end-of-prefix separators, it’s doing the exact same thing, but once for each nested prefix we’re going to grade.
As a stylized example, consider two activations, each of which is representing a bunch of things which have well-defined names in natural language, but where:
in one case, the model’s tokenizer encodes those names very inefficiently
in the other, the model’s tokenizer encodes the names much more efficiently
If we had a combined loss with a length penalty as in the post, the AV might end up reciting all of the names in both cases even though that results in a longer text in the first case; we just need the marginal reconstruction-error benefit of including each name to exceed the marginal length-penalty cost.
However, if we grade the model on reconstruction based on the first N tokens—where N does not adaptively vary with the content of the examples—then N might be too small to list all the names in the first case (but not in the second), and this could result in unnatural-sounding (or just undesirably imprecise) verbalizations of examples like that first case.
Yeah, I think the Matryoshka loss is more “mostly applicable” rather than “directly applicable”.
With Matryoshka SAEs there are two types of prefixes. The first is the dictionary subsets, which in our implementation were fixed and chosen in advance, which I agree doesn’t make sense for an NLA. The second is that the SAE could have K active features, and it could allocate the K active features across subsets as desired. This seems close to your “graded NLA prefixes” proposal.
There are also alternative forms of hierarchy that could be explored. For example, in Appendix C.1 of the Matryoshka paper there was an alternative implementation where the dictionary subset sizes where randomly sampled from a distribution when performing inference, which encouraged a more continuous hierarchy.
Another random idea (which may not be good) is to just pass the first K tokens of the AV verbalization to the AR (where K is chosen randomly), which may encourage the AV to put the more important stuff earlier in the generation, with the fine-grained details towards the end. But it could also encourage weird compression.
The widget actually works. SubhanAllah, Smitty.