I haven’t fully worked through the maths, but I think both IG and attribution patching break down here? The fundamental problem is that the discontinuity is invisible to IG because it only takes derivatives. Eg the ReLU and Jump ReLU below look identical from the perspective of IG, but not from the perspective of activation patching, I think.
Neel Nanda
From the title I expected this to be embarrassing for Eliezer, but that was actually extremely sweet, and good advice!
Great work! Obviously the results here speak for themselves, but I especially wanted to complement the authors on the writing. I thought this paper was a pleasure to read, and easily a top 5% exemplar of clear technical writing. Thanks for putting in the effort on that.
<3 Thanks so much, that’s extremely kind. Credit entirely goes to Sen and Arthur, which is even more impressive given that they somehow took this from a blog post to a paper in a two week sprint! (including re-running all the experiments!!)
Improving Dictionary Learning with Gated Sparse Autoencoders
How to use and interpret activation patching
[Full Post] Progress Update #1 from the GDM Mech Interp Team
[Summary] Progress Update #1 from the GDM Mech Interp Team
It seems like we have significant need for orgs like METR or the DeepMind dangerous capabilities evals team trying to operationalise these evals, but also regulators with authority building on that work to set them as explicit and objective standards. The latter feels maybe more practical for NIST to do, especially under Paul?
Thanks for the clear explanation, Mamba is more cursed and less Transformer like than I realised! And thanks for creating and open sourcing Mamba Lens, it looks like a very useful tool for anyone wanting to build on this stuff
Each element of the matrix, denoted as , is constrained to the interval . This means that for all , where indexes the query positions and indexes the key positions:
Why is this strictly less than 1? Surely if the dot product is 1.1 and you clamp, it gets clamped to exactly 1
Oh nice, I didn’t know Evan had a YouTube channel. He’s one of the most renowned olympiad coaches and seems highly competent
Thanks! I read and enjoyed the book based on this recommendation
I’m in favour of people having hobbies and fun projects to do in their downtime! That seems good and valuable for impact over the longterm, rather than thinking that every last moment needs to be productive
Thanks for open sourcing this! We’ve already been finding it really useful on the DeepMind mech interp team, and saved us the effort of writing our own :)
Great post! I’m pretty surprised by this result, and don’t have a clear story for what’s going on. Though my guess is closer to “adding noise with equal norm to the error is not a fair comparison, for some reason” than “SAEs are fundamentally broken”. I’d love to see someone try to figure out WTF is going on.
You may be able to notice data points where the SAE performs unusually badly at reconstruction? (Which is what you’d see if there’s a crucial missing feature)
What banner?
Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This’ll be added to the paper soon, sorry!). I’ll leave the other Qs for my co-authors