Paul Bogdan comments on Mech interp is not pre-paradigmatic

Paul Bogdan 7 Aug 2025 10:42 UTC
1 point
0
Cool post. I have a neuro background, and I’m sometimes asked about “Is neuro actually informative for mech interp,” so I’m interested in this point about CNC being the current paradigm. I have a few thoughts:
Are the paradigmatic ideas of mech interp from neuroscience?
You mention some examples of paradigmatic ideas:
- The idea that networks “represent” things;
- That these “representations” or computations can be distributed across multiple neurons or multiple parts of the network;
- That these representations can be superposed on top of one another in a linear fashion, as in the ‘linear representation hypothesis’ (e.g. Smolensky, 1990);
- That representations can form representational hierarchies, thus representing more abstract concepts on top of less abstract ones, such as the visual representational hierarchy.
These ideas are all from back in the 1960-1990s? My impression was that back then the different cognitive sciences, like neuroscience and AI were more mixed up. For example, Geoffrey Hinton worked in a psychology department briefly, and many of the big names of this age were “cognitive scientists.” So in that sense, it’s a reach to really call these neuroscience ideas?
That being said, there’s another point that comes to mind that you didn’t mentioned but I think can be more firmly called neuroscience: Neural networks organize themselves to efficiently encode information (https://en.wikipedia.org/wiki/Efficient_coding_hypothesis).
My impression is that CS departments mostly set aside the above theoretical ideas until the last five years, whereas neuroscience departments kept thinking about them. Additionally, although something like AlexNet used superposition and had polysemantic neurons, those weren’t discussed until the last 2010s. Because neuroscience kept thinking about them whereas CS departments didn’t, maybe it is fair to call them neuroscientific. However, I’m not sure how many theoretical advancements in computational neuroscience from 1990-2020 actually contributed to modern mech interp? Which would be an argument against calling them neuroscience.
Are neuroscientific methods used in mech interp?
You give some examples of methods:
- Max-activating dataset examples are basically what Hubel and Wiesel (1959) (and many researchers since) used to demonstrate the functional specialisation of particular neurons.
- Causal interventions, commonly used in mech interp, are the principle behind many neuroscience methods, including thermal ablation (burning parts of the brain), cooling (thus ‘turning off’ parts of the brain), optogenetic stimulation, and so on.
- Key data analysis methods, such as dimensionality reduction or sparse coding, that are used extensively in computational neuroscience (and sometimes directly developed for it) are also used extensively in mech interp.
These examples are mentioned a lot when discussing neuroscience and mech interp. However, some of these parallels feel a bit more surface-level than they might first appear, and one might be able to claim a parallel between mech interp and any scientific field. For instance, ablating neurons in LLMs and the brain is very common. However, upregulating/downregulating something is the most basic type of experimental manipulation and is used by basically every scientific field. Maybe when comparing mech interp and neuroscience, it’s generally worth stopping to ask: is the biggest similarity that the two are both working on neurons. If this part is set aside and you abstract a bit, can you make the same parallel to virtually every other scientific field?
Some other parallels between mech interp and neuro, however, are more niche and seemingly compelling. For example, I like the use of dimensionality reduction to visualize and search for cycles in activation space.
Where would mech interp be today if computational neuroscience never really existed. Would mech interp have come at the exact same methods? Something like ablation or upregulation I think undoubtedly. Maybe the tendency to use dimensionality reduction for visualization a bit less so (or something similar would have been developed but slightly different). It seems hard to make clear claims about where mech interp would be today if computational neuroscience didn’t exist.
Are neuroscientific and mech interp findings similar?
You give an example:
And in many cases, the standards of what constitute a legitimate contribution to the field are the same. In both, for instance, a legitimate contribution might include a demonstration that a neuron (whether in a brain or an artificial neural network) appears be involved in an interesting representation or computation, such as the ‘Jennifer Anniston’ neuron (Quiroga et al. 2005) or the ‘Donald Trump’ neuron (Goh et al. 2021).
This is an interesting point that I haven’t seen before. I think this is pretty fair and maybe a unique parallel, but it would be more correct to say that a legitimate contribution is that the brain/LLM performs some function using some specific interesting computation: e.g., hippocampal neurons often represent spatial information in terms of the association between distinct items (e.g., my monitor is above my desk), whereas LLMs do addition with possibly unintuitive circuits/mechanisms? However, when framed as so and distanced a bit from neurons, can we make a similar parallel to any scientific field? Don’t most scientific endeavors try to decompose functions into more precise functions?
In the end, if you want to take any scientific field in the world, call it the existing paradigm for mech interp, and decide that paradigm can’t be ML, then I can’t imagine anything better than computational neuroscience… and this is a clear argument, but this seems like a low bar.