Often gets confused because early literature doesn’t distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: “Not all Language Model Features are Linear” ⇒ not all the representations are linear (and is not a statement about what gets represented).
> Are there different circuits that appear in a network based on your definition of what a relevant feature is?
This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don’t know of any work doing circuit finding with multiple feature finding methods though (but I’d be excited about it).
> How crisp are these circuits that appear, both in toy examples and in the wild?
> What are the best examples of “circuits in the wild” that are actually robust?
The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.
> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?
Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).
> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)
I don’t know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926
Good resource: https://dynalist.io/d/n2ZWtnoYHrU1s4vnFSAQ519J ← Neel Nanda’s glossary.
> What is a feature?
Often gets confused because early literature doesn’t distinguish well between property of the input represented by a model and the internal representation. We tend to refer to the former as a feature and the latter as a latent these days. Eg: “Not all Language Model Features are Linear” ⇒ not all the representations are linear (and is not a statement about what gets represented).
> Are there different circuits that appear in a network based on your definition of what a relevant feature is?
This question seems potentially confusing. If you use different methods (eg: supervised vs unsupervised) you are likely to find different results. Eg: In a paper I supervised here https://arxiv.org/html/2409.14507v2 we looked at how SAEs compared to Linear probes. This was a comparison of methods for finding representations. I don’t know of any work doing circuit finding with multiple feature finding methods though (but I’d be excited about it).
> How crisp are these circuits that appear, both in toy examples and in the wild?
Read ACDC. https://arxiv.org/abs/2304.14997 . Generally, not crisp.
> What are the best examples of “circuits in the wild” that are actually robust?
The ARENA curriculum probably covers a few. there might be some papers comparing circuit finding methods that use a standard set of circuits you could find.
> If I have a tiny network trained on an algorithmic task, is there an automated search method I can use to identify relevant subgraphs of the neural network doing meaningful computation in a way that the circuits are distinct from each other?
Interesting question. See Neel’s thoughts here: https://www.neelnanda.io/mechanistic-interpretability/othello#finding-modular-circuits
> Does this depend on training?
Probably yes. Probably also on how the different tasks relate to each other (whether they have shareable intermediate results).
> (Is there a way to classify all circuits in a network (or >10% of them) exhaustively in a potentially computationally intractable manner?)
I don’t know if circuits are a good enough description of reality for this to be feasible. But you might find this interesting https://arxiv.org/abs/2501.14926