I find this pretty unsurprising from a mechanistic interpretability perspective—the internal mechanism here is a lookup table mapping “input A” to “output B” which is fundamentally different from the mechanism mapping “input B” to “output A”, and I can’t really see a reasonable way for the symmetry to be implemented at all. I made a Twitter thread explaining this in more detail, which people may find interesting.
Neel Nanda
Yeah, I agree that doing work that gets other people excited about sparse autoencoders is arguably more impactful than marginal compute savings, I’m just arguing that compute savings do matter.
I’m pretty concerned about the compute scaling of autoencoders to real models. I predict the scaling of the data needed and of the amount of features is super linear in d_model, which seems to scale badly to a frontier model
Cool work! I really like the ACDC on the parenthesis feature part, I’d love to see more work like that, and work digging into exactly how things compose with each other in terms of the weights.
This is a really great post, thanks for writing it! I learned interesting things, and love your writing style. Man, biology is weird
Trustees serve one-year terms and future Trustees will be elected by a vote of the Trustees
One year is shockingly short, why such fast turnaround?
And great post I’m excited to see responsible scaling policies becoming a thing!
That might work, though you could easily end up with the final model not actually faithfully using its world model to make the correct moves—if there’s more efficient/correct heuristics, there’s no guarantee it’ll use the expensive world model, or not just forget about it.
What’s the rationale behind the “Centre for AI Policy” name? It feels like it claims a bunch of credibility to speak for the field of AI policy that I don’t think you’ve (yet!) earned,and I’m concerned it may make the work harder of other people in the AI Policy space
Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy
I’m really excited to see this come out! I initially found the counter-example pretty hard to get my head around, but it now feels like a deep and important insight when thinking about using gradient descent based techniques like DAS safely.
An Interpretability Illusion for Activation Patching of Arbitrary Subspaces
Great post! I think a really cool research direction in mech interp would be looking for alignment relevant circuits in a misaligned model—it seems like the kind of concrete thing we could do mech interp on today (if we had such a model), and like it would teach us a ton about what to look for when eg auditing a potentially misaligned model. I’d love to hear about any progress you make, and possible room for collaboration.
Oh that’s fascinating, thanks for sharing! In the model I was studying I found that intervening on the token direction mattered a lot for ending lines after 80 characters. Maybe there are multiple directions...? Very weird!
Cool work! I’ve been interested in seeing a mech interp project trying to find the circuits behind sycophancy, it seems like a good microcosm for social modelling circuitry which seems a first step towards deception circuitry. How good is LLaMA 7B at being sycophantic? And do you have any thoughts on what might be good prompts for understanding sycophancy circuitry? I’m particularly interested in prompts that are modular, with key words that can be varied to change it from one valence to another while keeping the rest of the prompt intact.
Mech Interp Puzzle 2: Word2Vec Style Embeddings
Cool concept! Thanks for making it. And that’s a lovely looking website, especially for just three weeks!
The core problem with this kind of thing is that often neurons are not actually monosemantic, because models use significant superposition, so the neuron means many different things. This is a pretty insurmountable problem—I don’t think it sinks the concept of the website, but it seems valuable to eg have a “this seems like a polysemantic mess” button.
Bug report—in OWT often apostrophes or quote marks are tokenized as two separate tokens, because of a dumb bug in the tokenizer (they’re a weird unicode character that it doesn’t recognise, so it gets tokenized as two separate bytes). This looks confusing, eg here: (the gap between the name and s is an apostrophe). It’s unclear how best to deal with this, my recommendation is to have an empty string and then an apostrophe/quotation mark, and a footnote on hover explaining it.
Great answer! I didn’t follow the fine details of your response, but “there is no meaningful zero point” was the key point I wanted to get across. This is the first correct answer I’ve seen!
My guess is that messiness is actually a pretty inherent part of the whole thing? Models have an inherent reason to want to do the problem with a single clean solution, if they can simultaneously use the features “nth item in the list” and “labelled A” and even “has two incorrect answers before it” why not?
Interesting work! Cool to see mech interp done in such different modalities.
Did you look at neurons in other layers in the encoder? I’m curious if there are more semantic or meaningful audio features. I don’t know how many layers the encoder in Whisper tiny has.
Re localisation of attention, did you compute statistics per head of how far away it attends? That seems a natural way to get more info on this—I’d predict that most but not all encoder heads are highly localised (just like language models!). The fact that k=75 starts to mess up performance demonstrates that such heads must exist, IMO. And it’d be cool to investigate what kind of attentional features exist—what are the induction heads of audio encoders?!