Bart Bussmann comments on Letting Claude do Autonomous Research to Improve SAEs

Bart Bussmann 11 Mar 2026 0:18 UTC
22 points
2
Great post! Funnily enough, I did the exact same thing on the same task two weeks ago and my army of Claude agents found a different solution, reaching an F1 of 0.989!

Leave-One-Out Refinement
The innovation here is an inference-time method. The idea is that for each active latent, you ask whether removing it would actually hurt reconstruction. Concretely, you compute a projection score where is the reconstruction residual, is the decoder vector, and is the activation. If , the latent wasn’t contributing meaningfully to reconstruction and gets zeroed out. The whole thing is a single vectorized forward pass (no iteration), and it removes roughly a third of active latents!
```
x_hat = acts @ W_dec + b_dec                     # current reconstruction
residual = x - x_hat                             # reconstruction error
proj = residual @ W_dec.T + acts * dec_norms_sq  # LOO score per latent
keep = (proj > threshold) | (acts == 0)          # keep if score > τ
acts = acts * keep.float()                       # zero out spurious
```
I’ve also not tested it on real SAEBench, but it should be considerably cheaper to test as it is an inference-time method only. The full research report, completely written by Claude, here:

https://drive.google.com/file/d/1GSJrrPU6Q_TcwcjbsoF02yTOKvHhZiyj/view?usp=sharing
- chanind 11 Mar 2026 0:44 UTC
  4 points
  0
  Parent
  That’s such a cool idea, and really impressive F1 score! It also seems like it’s in the same vein of a slight refinement on the initial encoding. Would that not also work during training too? It seems like it would be safe to backprop through that refinement at training time. Did you do anything fancy for the setup, or just prompt Claude to increase the score in a loop?
  - Bart Bussmann 11 Mar 2026 0:56 UTC
    3 points
    0
    Parent
    Yeah, I think this could work during training as well, although you may get some weird dynamics because there is no penalty for highly-activating unhelpful latents to fire less. But I imagine you could at least use it as an auxiliary loss.
    
    I used @Clément Dumas’ research agents scaffold: https://github.com/Butanium/claude-lab/
    - Clément Dumas 11 Mar 2026 1:35 UTC
      3 points
      0
      Parent
      Oh cool to see it worked well on a well defined task!! I’ve been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself. I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)