leogao comments on leogao’s Shortform

leogao 15 Nov 2025 23:03 UTC
46 points
3
creating surprising adversarial attacks using our recent paper on circuit sparsity for interpretability
we train a model with sparse weights and isolate a tiny subset of the model (our “circuit”) that does this bracket counting task where the model has to predict whether to output ] or ]]. It’s simple enough that we can manually understand everything about it, every single weight and activation involved, and even ablate away everything else without destroying task performance.
(this diagram is for a slightly different task because i spent an embarassingly large number of hours making this figure and decided i never wanted to make another one ever again)
in particular, the model has a residual channel delta that activates twice as strongly when you’re in a nested list. it does this by using the attention to take the mean over a [ channel, so if you have two [s then it activates twice as strongly. and then later on it thresholds this residual channel to only output ]] when your nesting depth channel is at the stronger level.
but wait. the mean over a channel? doesn’t that mean you can make the context longer and “dilute” the value, until it falls below the threshold? then, suddenly, the model will think it’s only one level deep!
it turns out that indeed, this attack works really well on the entire sparse model (not just the circuit), and you can reliably trick it.
in retrospect, this failure is probably because extremely long nested rows are out of distribution on our specific pretraining dataset. but there’s no way i would have come up with this attack by just thinking about the model.
one other worry is maybe this is just because of some quirk of weight-sparse models. strikingly, it turns out that this attack also transfers to similarly capable dense models!
- Thane Ruthenis 16 Nov 2025 6:18 UTC
  31 points
  0
  Parent
  our recent paper
  Aside: For me, this paper is potentially the most exciting interpretability result of the past several years (since SAEs). Scaling it to GPT-3 and beyond seems like a very promising direction. Great job!
  - habryka 16 Nov 2025 8:37 UTC
    13 points
    −6
    Parent
    I agree! I admit I am not optimistic, but I am still very glad to see this.
- habryka 15 Nov 2025 23:37 UTC
  5 points
  0
  Parent
  How well does it generalize to “similarly capable dense models”? Just curious whether you have a graph for that (I haven’t read any part of the paper besides its first page, so feel free to just tell me go and do that before asking questions like this).
  - leogao 15 Nov 2025 23:48 UTC
    5 points
    0
    Parent
    i don’t have a graph for it. the corresponding number is p(correct) = 0.25 at 63 elements for the one dense model i ran this on. (the number is not in the paper yet because this last result came in approximately an hour ago)
    the other relevant result in the paper for answering the question of how similar our sparse models are to dense models is figure 33
- Oliver Daniels 3 Dec 2025 18:43 UTC
  1 point
  0
  Parent
  Noticed that you use task cross entropy loss instead of KL when learning task masks (Appendix 4.5, Loss Function) This is maybe a reasonable design choice, but important to note that this will ablate any “negative” nodes (and indirectly cause you to ignore positive nodes which overcome the negative nodes).
  Overall, I suspect that this causes the subnetworks to miss important model computations (but obviously decreases the size of the subnetwork)
  
  Aside: is there a reason there isn’t a top-level link-post for this paper? (if not I’ll create one)
  - Oliver Daniels 7 Dec 2025 14:02 UTC
    1 point
    0
    Parent
    Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
    Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.