Matt Levinson comments on Scaling Sparse Feature Circuit Finding to Gemma 9B

Matt Levinson 13 Jan 2025 21:58 UTC
4 points
0
Very cool work! I think scalable circuit finding is an exciting and promising area that could get us to practically relevant oversight capabilities driven by mechint with not too too long a timeline!

Did you think at all about ways to better capture interaction effects? I’ve thought about approaches similar to what you share here and really all that’s happening is a big lasso regression with the coefficients embedded in a functional form to make them “continuousified” indicator variables that contribute to the prediction part of the objective only by turning on or off the node they’re attached to. As is of course well known, lasso tends to select representative elements out of groups with strong interactions, or miss them entirely if the main effects are weak while only the interactions are meaningful. The stock answer in the classic regression context is to also include an L2 penalty to soften the edges of the L1 penalty contour, but this seems not a great answer in this context. We need the really strong sparsity bias of the solo L1 penalty in this context!

I don’t mean this as a knock on this work! Really strong effort that would’ve just been bogged down by trying to tackle the covariance/interaction problem on the first pass. I’m just wondering if you’ve had discussions or thoughts on that problem for future work in the journey of this research?