Yeah, I think this could work during training as well, although you may get some weird dynamics because there is no penalty for highly-activating unhelpful latents to fire less. But I imagine you could at least use it as an auxiliary loss.
I used @Clément Dumas’ research agents scaffold: https://github.com/Butanium/claude-lab/
Bart Bussmann
Great post! Funnily enough, I did the exact same thing on the same task two weeks ago and my army of Claude agents found a different solution, reaching an F1 of 0.989!
Leave-One-Out Refinement
The innovation here is an inference-time method. The idea is that for each active latent, you ask whether removing it would actually hurt reconstruction. Concretely, you compute a projection score where is the reconstruction residual, is the decoder vector, and is the activation. If , the latent wasn’t contributing meaningfully to reconstruction and gets zeroed out. The whole thing is a single vectorized forward pass (no iteration), and it removes roughly a third of active latents!
x_hat = acts @ W_dec + b_dec # current reconstruction residual = x - x_hat # reconstruction error proj = residual @ W_dec.T + acts * dec_norms_sq # LOO score per latent keep = (proj > threshold) | (acts == 0) # keep if score > τ acts = acts * keep.float() # zero out spurious
I’ve also not tested it on real SAEBench, but it should be considerably cheaper to test as it is an inference-time method only. The full research report, completely written by Claude, here:
https://drive.google.com/file/d/1GSJrrPU6Q_TcwcjbsoF02yTOKvHhZiyj/view?usp=sharing
It’s funny how this is like a reverse Searle’s Chinese room. A system meant to just shuffle some tokens around can’t help but understand its meaning!
Can we interpret latent reasoning using current mechanistic interpretability tools?
Update! I missed an entire evolutionary branch of the meme: “You can just do stuff” (rather than “things”).
In March 2021, @leaacta tweets:life hack: you don’t have to explain yourself or understand anything, you can just do stuff
And gets retweeted by a bunch of people in TPOT.
Then, in June 2022, comedian Rodney Norman posts a video called Go Be Weird with a motivational speech of some sort:
Hey, you know you can just do stuff?
Like, you don’t need anybody’s permission or anything.
You just… you just kind of come up with weird stuff you want to go do, and you just go do it.
Okay? Go be weird.
Okay, bye.
In August 2022, @nat_sharpe_ posts a video where he does a bunch of cool and weird things, with Rodney Norman’s speech as background sound. With this and after this he seems to play a major role in popularizing “you can just do stuff” on Twitter.
Sam Altman is also involved in this version of the meme, and in September 2022 he posts:you can just...do stuff it isn’t more complicated
On the origins of “you can just do things”
About once every 15 minutes, someone tweets “you can just do things”. It seems like a rather powerful and empowering meme and I was curious where it came from, so I did some research into its origins. Although I’m not very satisfied with what I was able to reconstruct, here are some of the things that I found:
In 1995, Steve Jobs gives the following quote in an interview:Life can be much broader, once you discover one simple fact, and that is that everything around you that you call life was made up by people that were no smarter than you. And you can change it, you can influence it, you can build your own things that other people can use. Once you learn that, you’ll never be the same again.
Although he says nothing close to the phrase “you can just do things”, I think it’d be a fair summary of his message.
Fast forward to October 2020, where Twitter user nosilverv tweets:PSA: if you feel bad you can just DO THINGS until you feel better
Although this is the first tweet I’ve found[1] that contains the exact phrasing, it doesn’t seem to quite match the current sentiment. It seems to imply that you can do random things and shouldn’t let bad feelings stop you, rather than the “high agency” framing it has today.
A year later, in October 2021, @Neel Nanda publishes the blog post What’s Stopping You?. I think this post points exactly in the “you can just do things” direction, and contains the following quote:Part of this mindset is taking responsibility—realising that you can do things and influence the world, and that by taking it upon yourself to fix or improve something the world will be better than if you did nothing.
So close, just missing the “just”!
Then in February 2022, substacker crypticdefinitions posts a post titled You can just do things:Advice probably obvious to everyone but myself
You can just go and try new things. It costs almost nothing and has extremely high potential upside. You can go try a new hobby or skill. You can pay for a singing lesson or squash coaching session or cooking course. You can email someone who seems somewhat interesting about a subject of mutual interest.
From here on, in the first half of 2022, some people on Twitter seem to start to adopt this phrase, such as @AskYatharth and @m_ashcroft. The phrase seems to slowly but steadily gain popularity in TPOT in the years 2023-2024. In January 2024, Cate Hall writes a blog post on “How to be more agentic” and announces the working title of her book “You can just do things”.
In December 2024 Sam Altman tweets “you can just do things” with 25K likes and the meme seems to break all containment.- ^
Unfortunately, the Twitter search function is completely broken and useless, so there may be earlier tweets I’ve not been able to find.
- ^
Current LLMs seem to rarely detect CoT tampering
When working with SAE features, I’ve usually relied on a linear intuition: a feature firing with twice the strength has about twice the “impact” on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature’s additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens is equal to the exponent of the logit difference .
If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn’t 5x its effect on , but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.
Note that this only holds for the direct ‘logit lens’-like effect of a feature. This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers.
Interesting idea, I had not considered this approach before!
I’m not sure this would solve feature absorption though. Thinking about the “Starts with E-” and “Elephant” example: if the “Elephant” latent absorbs the “Starts with E-” latent, the “Starts with E-” feature will develop a hole and not activate anymore on the input “elephant”. After the latent is absorbed, “Starts with E-” wouldn’t be in the list to calculate cumulative losses for that input anymore.
Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.
Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).
I haven’t actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.
Learning Multi-Level Features with Matryoshka SAEs
Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak:
The SAEs here were only trained for 100M tokens (1/3 the TinyStories[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more ‘real’ language models and train SAEs with much more data.
I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et al. as implemented in SAEBench!
Is there a nice way to extend the Matryoshka method to top-k SAEs?
Yes! My experiments with Matryoshka SAEs are using BatchTopK.
Are you planning to continue this line of research? If so, I would be interested to collaborate (or otherwise at least coordinate on not doing duplicate work).
This project seems to be trying to translate whale language.
You might enjoy this classic: https://www.lesswrong.com/posts/9HSwh2mE3tX6xvZ2W/the-pyramid-and-the-garden
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.
Definitely seems like multiple ways to interpret this work, as also described in SAE feature geometry is outside the superposition hypothesis. Either we need to find other methods and theory that somehow finds more atomic features, or we need to get a more complete picture of what the SAEs are learning at different levels of abstraction and composition.
Both seem important and interesting lines of work to me!
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.
I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.
Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.
If you want to go full autonomous research mode you could even have another Claude find adversarial parameters of the SynthSAEBench dataset (within some reasonable constraints) to see where the methods break or would perform worse than baselines.
I imagine you could find some nice robust improvements this way.