But there’s no reason to think that the model is actually using a sparse set of components /features on any given forward pass.
I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.
You’d actually have to quantify what you mean by few computations for your contention to be meaningful. By discussing “more computations than neurons”, you’re already operating on far more computations than I am thinking of (3072*12 for GPT-2 Small). The point is that no matter how many components a model is using in tandem, even if you’d consider it low, you’re still not going to get something faithful by trying to use the smallest number of components possible on any given input.
Take ACDC for example. I see no reason why we shouldn’t have an explanation that says every head in GPT2-Small is involved in the computation of an output. I can’t find the thread right now, but I recall a discussion where people were like “oh circuit tracing doesn’t give us a sparse circuit on this task, therefore it fails on this task”. Imo this is not a failure, it just exposes that circuit tracing doesn’t explain network behaviour, even when it does return a sparse circuit.
And if you have multiple heads with independent high-level behaviours, you’re not going to fix this with an SAE, because you’d need exponentially many features.
I contest this. If a model wants to implement more computations (for example, logic gates) in a layer than that layer has neurons, the known methods for doing this rely on few computations being used (that is, receiving a non-baseline input) on any given forward pass.
You’d actually have to quantify what you mean by few computations for your contention to be meaningful. By discussing “more computations than neurons”, you’re already operating on far more computations than I am thinking of (3072*12 for GPT-2 Small). The point is that no matter how many components a model is using in tandem, even if you’d consider it low, you’re still not going to get something faithful by trying to use the smallest number of components possible on any given input.
Take ACDC for example. I see no reason why we shouldn’t have an explanation that says every head in GPT2-Small is involved in the computation of an output. I can’t find the thread right now, but I recall a discussion where people were like “oh circuit tracing doesn’t give us a sparse circuit on this task, therefore it fails on this task”. Imo this is not a failure, it just exposes that circuit tracing doesn’t explain network behaviour, even when it does return a sparse circuit.
And if you have multiple heads with independent high-level behaviours, you’re not going to fix this with an SAE, because you’d need exponentially many features.