We called it vector arithmetic in my Othello paper?
Neel Nanda
Note that the previous post on this topic from Lawrence and Marius includes public comments about interp externalities from several people including me
The rough TLDR of my opinion is that interp seems reasonably far from being dangerous to capabilities, and there’s a ton of work to be done, and I think the value of building a thriving academic field around it seems super high, and I think significant barriers to publishing and sharing work is toxic to this, and IMO this consideration dominates over some capabilities externalities. I also think that getting capabilities people to pivot from whatever they’re doing now to mech interp seems like a win, it’s not clear this should be more capabilities-y than the counterfactual, but clearly better for alignment in expectation. I also think junior people get way too stressed about this and should chill out, since your first piece of work is highly unlikely to have any effects, positive or negative
I’d probably call it an act-add style intervention? Ablation connotes the removal of something. I originally thought patch, but patch should be about setting it to be equal to the activation on another input, and being able to put in 10x the original value doesn’t count as patching.
Multiple choice is just pretty hard! This seems somewhat similar to the results the DeepMind mech interp team found about a moderately complex circuit for multiple choice questions in Chinchilla 70B, it wouldn’t surprise me if for many of the smaller models (in this case, 13B ish) they just aren’t good at mapping their knowledge to the multiple choice syntax. Though I expect that to go away for larger models, as you see with GPT-4
Oh huh, I also misunderstood that, I thought red meant OP or something
I think a cool mechanistic interpretability project could be investigating why this happens! It’s generally a lot easier to work with small models, how strong was the effect for the 7B models you studied? (I found the appendix figures hard to parse) Do you think there’s a 7B model where this would be interesting to study? I’d love takes for concrete interpretability questions you think might be interesting here
Whisper seems sufficiently far from the systems pushing the capability frontier (GPT-4 and co) that I really don’t feel concerned about that here
Yeah, I expect reversing the text during pre-training to work—IMO this is analogous to augmenting the data to have an equal amount of A is B and B is A, which will obviously work. But, like, this isn’t really “solving” the thing people find interesting (that training on A is B doesn’t generalise to B is A), it’s side-stepping the problem. Maybe I’m just being picky though, I agree it should work.
Interesting work! Cool to see mech interp done in such different modalities.
Did you look at neurons in other layers in the encoder? I’m curious if there are more semantic or meaningful audio features. I don’t know how many layers the encoder in Whisper tiny has.
Re localisation of attention, did you compute statistics per head of how far away it attends? That seems a natural way to get more info on this—I’d predict that most but not all encoder heads are highly localised (just like language models!). The fact that k=75 starts to mess up performance demonstrates that such heads must exist, IMO. And it’d be cool to investigate what kind of attentional features exist—what are the induction heads of audio encoders?!
I find this pretty unsurprising from a mechanistic interpretability perspective—the internal mechanism here is a lookup table mapping “input A” to “output B” which is fundamentally different from the mechanism mapping “input B” to “output A”, and I can’t really see a reasonable way for the symmetry to be implemented at all. I made a Twitter thread explaining this in more detail, which people may find interesting.
Yeah, I agree that doing work that gets other people excited about sparse autoencoders is arguably more impactful than marginal compute savings, I’m just arguing that compute savings do matter.
I’m pretty concerned about the compute scaling of autoencoders to real models. I predict the scaling of the data needed and of the amount of features is super linear in d_model, which seems to scale badly to a frontier model
Cool work! I really like the ACDC on the parenthesis feature part, I’d love to see more work like that, and work digging into exactly how things compose with each other in terms of the weights.
This is a really great post, thanks for writing it! I learned interesting things, and love your writing style. Man, biology is weird
Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees
One year is shockingly short, why such fast turnaround?
And great post I’m excited to see responsible scaling policies becoming a thing!
That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves—if there’s more efficient/correct heuristics, there’s no guarantee it’ll use the expensive world model, or not just forget about it.
What’s the rationale behind the “Centre for AI Policy” name? It feels like it claims a bunch of credibility to speak for the field of AI policy that I don’t think you’ve (yet!) earned,and I’m concerned it may make the work harder of other people in the AI Policy space
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy
I’m really excited to see this come out! I initially found the counter-example pretty hard to get my head around, but it now feels like a deep and important insight when thinking about using gradient descent based techniques like DAS safely.
I feel skeptical of this in the context of problems in my life. It feels like there’s a ton of randomness, so I never have a problem ALL of the time given a certain trigger, I just have the problem a bunch