I’ve uploaded the code to github.
Matt Levinson
Activation Magnitudes Matter On Their Own: Insights from Language Model Distributional Analysis
Beyond Gaussian: Language Model Representations and Distributions
I’m a new OpenPhil fellow for a mid-career transition—from other spaces in AI/ML—into AI safety, with an interest in interpretability. Given my experience, I bias towards intuitively optimistic about mechanical interpretability in the sense of discovering representations and circuits and trying to make sense of them. But I’ve started my deep dive into the literature. I’d be really interested to hear from @Buck and @ryan_greenblatt and those who share their skepticism about what directions they prefer to invest for their own and their team’s research efforts!
Out of the convo and the comments I got relying more on probes rather than dictionaries and circuits alone. But I feel pretty certain that’s not the complete picture! I came to this convo from the Causal Scrubbing thread which felt exciting to me and like a potential source of inspiration for a mini research project for my fellowship (6 months, including ramp up/learning). I was a bit bummed to learn that the authors found the main benefit of that project to be informing them to abandon mech interp :-D
On a related note, one of the other papers that put me on a path to this thread was this one on Causal Mediation. Fairly long ago at this point I had a phase of interest in Pearl’s causal theory and thought that paper was a nice example of thinking about what’s essentially ablation and activation patching from that point of view. Are there any folks who’ve taken a deeper stab at leveraging some of the more recent theoretical advances in graphical causal theory to do mech interp? Would super appreciate any pointers!
Very cool work! I think scalable circuit finding is an exciting and promising area that could get us to practically relevant oversight capabilities driven by mechint with not too too long a timeline!
Did you think at all about ways to better capture interaction effects? I’ve thought about approaches similar to what you share here and really all that’s happening is a big lasso regression with the coefficients embedded in a functional form to make them “continuousified” indicator variables that contribute to the prediction part of the objective only by turning on or off the node they’re attached to. As is of course well known, lasso tends to select representative elements out of groups with strong interactions, or miss them entirely if the main effects are weak while only the interactions are meaningful. The stock answer in the classic regression context is to also include an L2 penalty to soften the edges of the L1 penalty contour, but this seems not a great answer in this context. We need the really strong sparsity bias of the solo L1 penalty in this context!
I don’t mean this as a knock on this work! Really strong effort that would’ve just been bogged down by trying to tackle the covariance/interaction problem on the first pass. I’m just wondering if you’ve had discussions or thoughts on that problem for future work in the journey of this research?