Lucius Bushnaq comments on Attribution-based parameter decomposition

Lucius Bushnaq 26 Jan 2025 19:40 UTC
5 points
0
- curious about your optimism regarding learned masks as attribution method—seems like the problem of learning mechanisms that don’t correspond to model mechanisms is real for circuits (see Interp Bench) and would plausibly bite here too (through should be able to resolve this with benchmarks on downstream tasks once ADP is more mature)
We think this may not be a problem here, because the definition of parameter component ‘activity’ is very constraining. See Appendix section A.1.
To count as inactive, it’s not enough for components to not influence the output if you turn them off, every point on every possible monotonic trajectory between all components being on, and only the components deemed ‘active’ being on, has to give the same output. If you (approximately) check for this condition, I think the function that picks the learned masks can kind of be as expressive as it likes, because the sparse forward pass can’t rely on the mask to actually perform any useful computation labor.

Conceptually, this is maybe one of the biggest differences between APD and something like, say, a transcoder or a crosscoder. It’s why it doesn’t seem to me like there’d be an analog to ‘feature splitting’ in APD. If you train a transcoder on a $d$ -dimensional linear transformation, it will learn ever sparser approximations of this transformation the larger you make the transcoder dictionary, with no upper limit. If you train APD on a $d$ -dimensional linear transformation, provided it’s tuned right, I think it should learn a single $d$ -dimensional component. Regardless of how much larger than d you make the component dictionary. Because if it tried to learn more components than that to get a sparser solution, it wouldn’t be able to make the components sum to the original model weights anymore.

Despite this constraint on its structure, I think APD plausibly has all the expressiveness it needs, because even when there is an overcomplete basis of features in activation space, circuits in superposition math and information theory both suggest that you can’t have an overcomplete basis of mechanisms in parameter space. So it seems to me that you can just demand that components must compose linearly, without that restricting their ability to represent the structure of the target model. And that demand then really limits the ability to sneak in any structure that wasn’t originally in the target model.