Neel Nanda comments on Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda 7 Dec 2024 1:58 UTC
LW: 4 AF: 4
0
AF
I’m not super sure what I think of this project. I endorse the seed of the idea re “let’s try to properly reverse engineer what representing facts in superposition looks like” and think this was a good idea ex ante. Ex post, I consider our results fairly negative, and have mostly confused that this kind of thing is cursed and we should pursue alternate approaches to interpretability (eg transcoders). I think this is a fairly useful insight! But also something I made from various other bits of data. Overall I think this was a fairly useful conclusion re updating away from ambitious mech interp and has had a positive impact on my future research, though it’s harder to say if this impacted others (beyond the general sphere of people I mentor/manage)

I think the circuit analysis here is great, a decent case study of what high quality circuit analysis looks like, one of studies of factual recall I trust most (though I’m biased), and introduced some new tricks that I think are widely useful, like using probes to understand when information is introduced Vs signal boosted, and using mechanistic probes to interpret activations without needing training data. However, I largely haven’t seen much work build on this, beyond a few scattered examples, which suggests it hasn’t been too impactful. I also think this project took much longer than it should have, which is a bit sad.

Though, this did get discussed in a 3Blue1Brown video, which is the most important kind of impact!