Bart Bussmann

Karma: 954

Bart Bussmann 25 May 2026 12:24 UTC
12 points
5
in reply to: Hastings’s comment on: Linch’s Shortform
Fair point! Looking back at the 2024 encyclical there are also ~10 occurrences of this pattern, so this is not much evidence of AI-writing.

Bart Bussmann 25 May 2026 11:52 UTC
2 points
0
in reply to: Linch’s comment on: Linch’s Shortform
There are not only these signs, but also around 10-15 occurrences of the “not only X, but also Y” pattern.

[Linkpost] Interpreting Language Model Parameters

Lucius Bushnaq, Dan Braun, Oliver Clive-Griffin, Bart Bussmann, Nathan Hu, mivanitskiy, Linda Linsefors and Lee Sharkey

5 May 2026 17:37 UTC

162 points

2 comments2 min readLW link

(www.goodfire.ai)

Bart Bussmann 11 Mar 2026 3:25 UTC
1 point
0
in reply to: chanind’s comment on: Letting Claude do Autonomous Research to Improve SAEs
If you want to go full autonomous research mode you could even have another Claude find adversarial parameters of the SynthSAEBench dataset (within some reasonable constraints) to see where the methods break or would perform worse than baselines.
I imagine you could find some nice robust improvements this way.

Bart Bussmann 11 Mar 2026 0:56 UTC
3 points
0
in reply to: chanind’s comment on: Letting Claude do Autonomous Research to Improve SAEs
Yeah, I think this could work during training as well, although you may get some weird dynamics because there is no penalty for highly-activating unhelpful latents to fire less. But I imagine you could at least use it as an auxiliary loss.

I used @Clément Dumas’ research agents scaffold: https://github.com/Butanium/claude-lab/

Bart Bussmann 11 Mar 2026 0:18 UTC
22 points
2
on: Letting Claude do Autonomous Research to Improve SAEs
Great post! Funnily enough, I did the exact same thing on the same task two weeks ago and my army of Claude agents found a different solution, reaching an F1 of 0.989!

Leave-One-Out Refinement
The innovation here is an inference-time method. The idea is that for each active latent, you ask whether removing it would actually hurt reconstruction. Concretely, you compute a projection score where is the reconstruction residual, is the decoder vector, and is the activation. If , the latent wasn’t contributing meaningfully to reconstruction and gets zeroed out. The whole thing is a single vectorized forward pass (no iteration), and it removes roughly a third of active latents!
```
x_hat = acts @ W_dec + b_dec                     # current reconstruction
residual = x - x_hat                             # reconstruction error
proj = residual @ W_dec.T + acts * dec_norms_sq  # LOO score per latent
keep = (proj > threshold) | (acts == 0)          # keep if score > τ
acts = acts * keep.float()                       # zero out spurious
```
I’ve also not tested it on real SAEBench, but it should be considerably cheaper to test as it is an inference-time method only. The full research report, completely written by Claude, here:

https://drive.google.com/file/d/1GSJrrPU6Q_TcwcjbsoF02yTOKvHhZiyj/view?usp=sharing

Bart Bussmann 7 Feb 2026 19:41 UTC
43 points
12
on: Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning
It’s funny how this is like a reverse Searle’s Chinese room. A system meant to just shuffle some tokens around can’t help but understand its meaning!

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

Bart Bussmann 15 Dec 2025 22:20 UTC
6 points
0
in reply to: Bart Bussmann’s comment on: Bart Bussmann’s Shortform
Update! I missed an entire evolutionary branch of the meme: “You can just do stuff” (rather than “things”).

In March 2021, @leaacta tweets:
life hack: you don’t have to explain yourself or understand anything, you can just do stuff
And gets retweeted by a bunch of people in TPOT.

Then, in June 2022, comedian Rodney Norman posts a video called Go Be Weird with a motivational speech of some sort:
Hey, you know you can just do stuff?
Like, you don’t need anybody’s permission or anything.
You just… you just kind of come up with weird stuff you want to go do, and you just go do it.
Okay? Go be weird.
Okay, bye.
In August 2022, @nat_sharpe_ posts a video where he does a bunch of cool and weird things, with Rodney Norman’s speech as background sound. With this and after this he seems to play a major role in popularizing “you can just do stuff” on Twitter.

Sam Altman is also involved in this version of the meme, and in September 2022 he posts:
you can just...do stuff it isn’t more complicated

Bart Bussmann 14 Dec 2025 13:27 UTC
18 points
0
on: Bart Bussmann’s Shortform
On the origins of “you can just do things”
About once every 15 minutes, someone tweets “you can just do things”. It seems like a rather powerful and empowering meme and I was curious where it came from, so I did some research into its origins. Although I’m not very satisfied with what I was able to reconstruct, here are some of the things that I found:

In 1995, Steve Jobs gives the following quote in an interview:
Life can be much broader, once you discover one simple fact, and that is that everything around you that you call life was made up by people that were no smarter than you. And you can change it, you can influence it, you can build your own things that other people can use. Once you learn that, you’ll never be the same again.
Although he says nothing close to the phrase “you can just do things”, I think it’d be a fair summary of his message.

Fast forward to October 2020, where Twitter user nosilverv tweets:
PSA: if you feel bad you can just DO THINGS until you feel better
Although this is the first tweet I’ve found^[1] that contains the exact phrasing, it doesn’t seem to quite match the current sentiment. It seems to imply that you can do random things and shouldn’t let bad feelings stop you, rather than the “high agency” framing it has today.

A year later, in October 2021, @Neel Nanda publishes the blog post What’s Stopping You?. I think this post points exactly in the “you can just do things” direction, and contains the following quote:
Part of this mindset is taking responsibility—realising that you can do things and influence the world, and that by taking it upon yourself to fix or improve something the world will be better than if you did nothing.
So close, just missing the “just”!

Then in February 2022, substacker crypticdefinitions posts a post titled You can just do things:
Advice probably obvious to everyone but myself
You can just go and try new things. It costs almost nothing and has extremely high potential upside. You can go try a new hobby or skill. You can pay for a singing lesson or squash coaching session or cooking course. You can email someone who seems somewhat interesting about a subject of mutual interest.

From here on, in the first half of 2022, some people on Twitter seem to start to adopt this phrase, such as @AskYatharth and @m_ashcroft. The phrase seems to slowly but steadily gain popularity in TPOT in the years 2023-2024. In January 2024, Cate Hall writes a blog post on “How to be more agentic” and announces the working title of her book “You can just do things”.

In December 2024 Sam Altman tweets “you can just do things” with 25K likes and the meme seems to break all containment.
1. ^
  Unfortunately, the Twitter search function is completely broken and useless, so there may be earlier tweets I’ve not been able to find.

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Bart Bussmann 16 Jul 2025 9:17 UTC
3 points
0
on: Bart Bussmann’s Shortform
When working with SAE features, I’ve usually relied on a linear intuition: a feature firing with twice the strength has about twice the “impact” on the model. But while playing with an SAE trained on the final layer I was reminded that the actual direct impact on the relative token probabilities grows exponentially with activation strength. While a feature’s additive contribution to the logits is indeed linear with its activation strength, the ratio of probabilities of two competing tokens $P (A) / P (B)$ is equal to the exponent of the logit difference $exp (logit (A) - logit (B))$ .
If we have a feature that boosts logit(A) and not logit(B) and we multiply its activation strength by a factor of 5.0, this doesn’t 5x its effect on $P (A) / P (B)$ , but rather raises its effect to the 5th power. If this feature caused token A to be three times as likely as token B before, it now makes this token 3^5 = 243 times as likely! This might partly explain why the lower activations for a feature are often less interpretable than the top activations. Their direct impact on the relative token probabilities is exponentially smaller.
Note that this only holds for the direct ‘logit lens’-like effect of a feature. This makes this intuition mostly applicable to features in the final layers of a model, as the impact of earlier features is probably mostly modulated by their effect on later layers.

Bart Bussmann 5 Apr 2025 7:06 UTC
1 point
0
in reply to: mozzarellapesto’s comment on: Learning Multi-Level Features with Matryoshka SAEs
Interesting idea, I had not considered this approach before!

I’m not sure this would solve feature absorption though. Thinking about the “Starts with E-” and “Elephant” example: if the “Elephant” latent absorbs the “Starts with E-” latent, the “Starts with E-” feature will develop a hole and not activate anymore on the input “elephant”. After the latent is absorbed, “Starts with E-” wouldn’t be in the list to calculate cumulative losses for that input anymore.

Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.

Bart Bussmann 23 Dec 2024 8:34 UTC
1 point
1
in reply to: Jose Sepulveda’s comment on: Learning Multi-Level Features with Matryoshka SAEs
Although the code has the option to add a L1-penalty, in practice I set the l1_coeff to 0 in all my experiments (see main.py for all hyperparameters).

Bart Bussmann 23 Dec 2024 8:32 UTC
4 points
0
on: Hire (or become) a Thinking Assistant / Body Double
I haven’t actually tried this, but recently heard about focusbuddy.ai, which might be a useful ai assistant in this space.

Learning Multi-Level Features with Matryoshka SAEs

Bart Bussmann, Patrick Leask and Neel Nanda

19 Dec 2024 15:59 UTC

46 points

6 comments11 min readLW link

Bart Bussmann 14 Dec 2024 14:27 UTC
LW: 35 AF: 20
2
AF
on: Matryoshka Sparse Autoencoders
Great work! I have been working on something very similar and will publish my results here some time next week, but can already give a sneak-peak:
The SAEs here were only trained for 100M tokens (1/3 the TinyStories^[11:1] dataset). The language model was trained for 3 epochs on the 300M token TinyStories dataset. It would be good to validate these results with more ‘real’ language models and train SAEs with much more data.
I can confirm that on Gemma-2-2B Matryoshka SAEs dramatically improve the absorption score on the first-letter task from Chanin et al. as implemented in SAEBench!
Is there a nice way to extend the Matryoshka method to top-k SAEs?
Yes! My experiments with Matryoshka SAEs are using BatchTopK.

Are you planning to continue this line of research? If so, I would be interested to collaborate (or otherwise at least coordinate on not doing duplicate work).

Bart Bussmann 28 Nov 2024 15:32 UTC
2 points
0
on: Visible Thoughts Project and Bounty Announcement
Three years later, and we actually got LLMs with visible thoughts, such as Deepseek, QwQ, and (although partially hidden from the user) o1-preview.
I (Nate) find it plausible that there are capabilities advances to be had from training language models on thought-annotated dungeon runs.

Good call!

Bart Bussmann 23 Nov 2024 19:56 UTC
8 points
3
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Sing along! https://suno.com/song/35d62e76-eac7-4733-864d-d62104f4bfd0

Bart Bussmann 7 Nov 2024 15:31 UTC
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Could we use current AI methods to understand dolphins?
This project seems to be trying to translate whale language.

Bart Bussmann

[Linkpost] In­ter­pret­ing Lan­guage Model Parameters

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

On the origins of “you can just do things”

Cur­rent LLMs seem to rarely de­tect CoT tampering

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

[Linkpost] Interpreting Language Model Parameters

Can we interpret latent reasoning using current mechanistic interpretability tools?

Current LLMs seem to rarely detect CoT tampering

Learning Multi-Level Features with Matryoshka SAEs