Bart Bussmann

Karma: 670

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

51 points

0 comments20 min readLW link

Bart Bussmann 16 Jul 2025 9:17 UTC
3 points
0
on: Bart Bussmann’s Shortform
When working with SAE features, I’ve usually relied on a linear intuition: a feature firing with twice the strength has about twice the “impact” on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature’s additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens $P (A) / P (B)$ is equal to the exponent of the logit difference $exp (logit (A) - logit (B))$ .
If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn’t 5x its effect on $P (A) / P (B)$ , but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.
Note that this only holds for the direct ‘logit lens’-like effect of a feature. This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers.

Bart Bussmann 5 Apr 2025 7:06 UTC
1 point
0
in reply to: mozzarellapesto’s comment on: Learning Multi-Level Features with Matryoshka SAEs
Interesting idea, I had not considered this approach before!

I’m not sure this would solve feature absorption though. Thinking about the “Starts with E-” and “Elephant” example: if the “Elephant” latent absorbs the “Starts with E-” latent, the “Starts with E-” feature will develop a hole and not activate anymore on the input “elephant”. After the latent is absorbed, “Starts with E-” wouldn’t be in the list to calculate cumulative losses for that input anymore.

Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.

Bart Bussmann 23 Dec 2024 8:34 UTC
1 point
1
in reply to: Jose Sepulveda’s comment on: Learning Multi-Level Features with Matryoshka SAEs
Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).

Bart Bussmann 23 Dec 2024 8:32 UTC
4 points
0
on: Hire (or become) a Thinking Assistant / Body Double
I haven’t actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

19 Dec 2024 15:59 UTC

43 points

6 comments11 min readLW link

Bart Bussmann 14 Dec 2024 14:27 UTC
LW: 35 AF: 20
2
AF
on: Matryoshka Sparse Autoencoders
Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak:
The SAEs here were only trained for 100M tokens (1/3 the TinyStories^[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more ‘real’ language models and train SAEs with much more data.
I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et al. as implemented in SAEBench!
Is there a nice way to extend the Matryoshka method to top-k SAEs?
Yes! My experiments with Matryoshka SAEs are using BatchTopK.

Are you planning to continue this line of research? If so, I would be interested to collaborate (or otherwise at least coordinate on not doing duplicate work).

Bart Bussmann 28 Nov 2024 15:32 UTC
2 points
0
on: Visible Thoughts Project and Bounty Announcement
Three years later, and we actually got LLMs with visible thoughts, such as Deepseek, QwQ, and (although partially hidden from the user) o1-preview.
I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs.

Good call!

Bart Bussmann 23 Nov 2024 19:56 UTC
8 points
3
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Sing along! https://suno.com/song/35d62e76-eac7-4733-864d-d62104f4bfd0

Bart Bussmann 7 Nov 2024 15:31 UTC
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Could we use current AI methods to understand dolphins?
This project seems to be trying to translate whale language.

Bart Bussmann 2 Oct 2024 9:33 UTC
6 points
0
in reply to: Canaletto’s comment on: weightt an’s Shortform
You might enjoy this classic: https://www.lesswrong.com/posts/9HSwh2mE3tX6xvZ2W/the-pyramid-and-the-garden

Bart Bussmann 25 Sep 2024 19:51 UTC
6 points
0
in reply to: Joseph Bloom’s comment on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Rather than doubling down on a single single-layered decomposition for all activations, why not go with a multi-layered decomposition (ie: some combination of SAE and metaSAE, preferably as unsupervised as possible). Or alternatively, maybe the decomposition that is most useful in each case changes and what we really need is lots of different (somewhat) interpretable decompositions and an ability to quickly work out which is useful in context.

Definitely seems like multiple ways to interpret this work, as also described in SAE feature geometry is outside the superposition hypothesis. Either we need to find other methods and theory that somehow finds more atomic features, or we need to get a more complete picture of what the SAEs are learning at different levels of abstraction and composition.

Both seem important and interesting lines of work to me!

Bart Bussmann 25 Sep 2024 13:06 UTC
8 points
0
on: [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
Great work! Using spelling is very clear example of how information gets absorbed in the SAE latent, and indeed in Meta-SAEs we found many spelling/sound related meta-latents.

I have been thinking a bit on how to solve this problem and one experiment that I would like to try is to train an SAE and a meta-SAE concurrently, but in an adversarial manner (kind of like a GAN), such that the SAE is incentivized to learn latent directions that are not easily decomposable by the meta-SAE.

Potentially, this would remove the “Starts-with-L”-component from the “lion”-token direction and activate the “Starts-with-L” latent instead. Although this would come at the cost of worse sparsity/reconstruction.

Bart Bussmann 18 Sep 2024 7:41 UTC
1 point
0
in reply to: tailcalled’s comment on: Why I’m bearish on mechanistic interpretability: the shards are not in the network
I think @RogerDearnaley means Sparse Autoencoders (SAEs), see for example these papers and the SAE tag on LessWrong.

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Bart Bussmann, Michael Pearce, Patrick Leask, Joseph Bloom, Lee Sharkey and Neel Nanda

24 Aug 2024 0:56 UTC

73 points

10 comments20 min readLW link

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

Patrick Leask, Bart Bussmann and Neel Nanda

17 Aug 2024 1:16 UTC

54 points

0 comments5 min readLW link

Bart Bussmann 1 Aug 2024 3:24 UTC
7 points
2
in reply to: habryka’s comment on: tlevin’s Shortform
Having been at two LH parties, one with music and one without, I definitely ended up in the “large conversation with 2 people talking and 5 people listening”-situation much more in the party without music.

That said, I did find it much easier to meet new people at the party without music, as this also makes it much easier to join conversations that sound interesting when you walk past (being able to actually overhear them).

This might be one of the reasons why people tend to progressively increase the volume of the music during parties. First give people a chance to meet interesting people and easily join conversations. Then increase the volume to facilitate smaller conversations.

Bart Bussmann 23 Jul 2024 17:23 UTC
4 points
0
on: Caring about excellence
I just finished reading “Zen and the Art of Motorcycle Maintenance” yesterday, which you might enjoy reading as it explores the topic of Quality (what you call excellence). From the book:
“Care and Quality are internal and external aspects of the same thing. A person who sees Quality and feels it as he works is a person who cares. A person who cares about what he sees and does is a person who’s bound to have some characteristic of quality.”

BatchTopK: A Simple Improvement for TopK-SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

20 Jul 2024 2:20 UTC

61 points

0 comments4 min readLW link

Bart Bussmann 18 Jul 2024 4:48 UTC
2 points
0
in reply to: 1stuserhere’s comment on: Daniel Tan’s Shortform
Interesting, we find that all features in a smaller SAE have a feature in a larger SAE with cosine similarity > 0.7, but not all features in a larger SAE have a close relative in a smaller SAE (but about ~65% do have a close equavalent at 2x scale up).

Bart Bussmann

Cur­rent LLMs seem to rarely de­tect CoT tampering

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

Show­ing SAE La­tents Are Not Atomic Us­ing Meta-SAEs

Cal­en­dar fea­ture ge­om­e­try in GPT-2 layer 8 resi­d­ual stream SAEs

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Current LLMs seem to rarely detect CoT tampering

Learning Multi-Level Features with Matryoshka SAEs

Showing SAE Latents Are Not Atomic Using Meta-SAEs

Calendar feature geometry in GPT-2 layer 8 residual stream SAEs

BatchTopK: A Simple Improvement for TopK-SAEs