Great post! Funnily enough, I did the exact same thing on the same task two weeks ago and my army of Claude agents found a different solution, reaching an F1 of 0.989!
Leave-One-Out Refinement The innovation here is an inference-time method. The idea is that for each active latent, you ask whether removing it would actually hurt reconstruction. Concretely, you compute a projection score where is the reconstruction residual, is the decoder vector, and is the activation. If , the latent wasn’t contributing meaningfully to reconstruction and gets zeroed out. The whole thing is a single vectorized forward pass (no iteration), and it removes roughly a third of active latents!
x_hat = acts @ W_dec + b_dec # current reconstruction
residual = x - x_hat # reconstruction error
proj = residual @ W_dec.T + acts * dec_norms_sq # LOO score per latent
keep = (proj > threshold) | (acts == 0) # keep if score > τ
acts = acts * keep.float() # zero out spurious
I’ve also not tested it on real SAEBench, but it should be considerably cheaper to test as it is an inference-time method only. The full research report, completely written by Claude, here:
That’s such a cool idea, and really impressive F1 score! It also seems like it’s in the same vein of a slight refinement on the initial encoding. Would that not also work during training too? It seems like it would be safe to backprop through that refinement at training time. Did you do anything fancy for the setup, or just prompt Claude to increase the score in a loop?
Yeah, I think this could work during training as well, although you may get some weird dynamics because there is no penalty for highly-activating unhelpful latents to fire less. But I imagine you could at least use it as an auxiliary loss.
Oh cool to see it worked well on a well defined task!! I’ve been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself.
I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)
Great post! Funnily enough, I did the exact same thing on the same task two weeks ago and my army of Claude agents found a different solution, reaching an F1 of 0.989!
where is the reconstruction residual, is the decoder vector, and is the activation. If , the latent wasn’t contributing meaningfully to reconstruction and gets zeroed out. The whole thing is a single vectorized forward pass (no iteration), and it removes roughly a third of active latents!
Leave-One-Out Refinement
The innovation here is an inference-time method. The idea is that for each active latent, you ask whether removing it would actually hurt reconstruction. Concretely, you compute a projection score
I’ve also not tested it on real SAEBench, but it should be considerably cheaper to test as it is an inference-time method only. The full research report, completely written by Claude, here:
https://drive.google.com/file/d/1GSJrrPU6Q_TcwcjbsoF02yTOKvHhZiyj/view?usp=sharing
That’s such a cool idea, and really impressive F1 score! It also seems like it’s in the same vein of a slight refinement on the initial encoding. Would that not also work during training too? It seems like it would be safe to backprop through that refinement at training time. Did you do anything fancy for the setup, or just prompt Claude to increase the score in a loop?
Yeah, I think this could work during training as well, although you may get some weird dynamics because there is no penalty for highly-activating unhelpful latents to fire less. But I imagine you could at least use it as an auxiliary loss.
I used @Clément Dumas’ research agents scaffold: https://github.com/Butanium/claude-lab/
Oh cool to see it worked well on a well defined task!! I’ve been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself. I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)