creating surprising adversarial attacks using our recent paper on circuit sparsity for interpretability
we train a model with sparse weights and isolate a tiny subset of the model (our “circuit”) that does this bracket counting task where the model has to predict whether to output ] or ]]. It’s simple enough that we can manually understand everything about it, every single weight and activation involved, and even ablate away everything else without destroying task performance.
(this diagram is for a slightly different task because i spent an embarassingly large number of hours making this figure and decided i never wanted to make another one ever again)
in particular, the model has a residual channel delta that activates twice as strongly when you’re in a nested list. it does this by using the attention to take the mean over a [ channel, so if you have two [s then it activates twice as strongly. and then later on it thresholds this residual channel to only output ]] when your nesting depth channel is at the stronger level.
but wait. the mean over a channel? doesn’t that mean you can make the context longer and “dilute” the value, until it falls below the threshold? then, suddenly, the model will think it’s only one level deep!
it turns out that indeed, this attack works really well on the entire sparse model (not just the circuit), and you can reliably trick it.
in retrospect, this failure is probably because extremely long nested rows are out of distribution on our specific pretraining dataset. but there’s no way i would have come up with this attack by just thinking about the model.
one other worry is maybe this is just because of some quirk of weight-sparse models. strikingly, it turns out that this attack also transfers to similarly capable dense models!
Aside: For me, this paper is potentially the most exciting interpretability result of the past several years (since SAEs). Scaling it to GPT-3 and beyond seems like a very promising direction. Great job!
How well does it generalize to “similarly capable dense models”? Just curious whether you have a graph for that (I haven’t read any part of the paper besides its first page, so feel free to just tell me go and do that before asking questions like this).
i don’t have a graph for it. the corresponding number is p(correct) = 0.25 at 63 elements for the one dense model i ran this on. (the number is not in the paper yet because this last result came in approximately an hour ago)
the other relevant result in the paper for answering the question of how similar our sparse models are to dense models is figure 33
creating surprising adversarial attacks using our recent paper on circuit sparsity for interpretability
we train a model with sparse weights and isolate a tiny subset of the model (our “circuit”) that does this bracket counting task where the model has to predict whether to output
]or]]. It’s simple enough that we can manually understand everything about it, every single weight and activation involved, and even ablate away everything else without destroying task performance.(this diagram is for a slightly different task because i spent an embarassingly large number of hours making this figure and decided i never wanted to make another one ever again)
in particular, the model has a residual channel delta that activates twice as strongly when you’re in a nested list. it does this by using the attention to take the mean over a
[channel, so if you have two[s then it activates twice as strongly. and then later on it thresholds this residual channel to only output]]when your nesting depth channel is at the stronger level.but wait. the mean over a channel? doesn’t that mean you can make the context longer and “dilute” the value, until it falls below the threshold? then, suddenly, the model will think it’s only one level deep!
it turns out that indeed, this attack works really well on the entire sparse model (not just the circuit), and you can reliably trick it.
in retrospect, this failure is probably because extremely long nested rows are out of distribution on our specific pretraining dataset. but there’s no way i would have come up with this attack by just thinking about the model.
one other worry is maybe this is just because of some quirk of weight-sparse models. strikingly, it turns out that this attack also transfers to similarly capable dense models!
Aside: For me, this paper is potentially the most exciting interpretability result of the past several years (since SAEs). Scaling it to GPT-3 and beyond seems like a very promising direction. Great job!
I agree! I admit I am not optimistic, but I am still very glad to see this.
How well does it generalize to “similarly capable dense models”? Just curious whether you have a graph for that (I haven’t read any part of the paper besides its first page, so feel free to just tell me go and do that before asking questions like this).
i don’t have a graph for it. the corresponding number is p(correct) = 0.25 at 63 elements for the one dense model i ran this on. (the number is not in the paper yet because this last result came in approximately an hour ago)
the other relevant result in the paper for answering the question of how similar our sparse models are to dense models is figure 33