Daniel Murfet comments on Should we publish mechanistic interpretability research?

Daniel Murfet 21 Apr 2023 20:59 UTC
24 points
11
The set of motivated, intelligent people with the relevant skills to do technical alignment work in general, and mechanistic interpretability in particular, has a lot of overlap with the set of people who can do capabilities work. That includes many academics, and students in masters and PhD programs. One way or another they’re going to publish, would you rather it be alignment/interpretability work or capabilities work?
It seems to me that speeding up alignment work by several orders of magnitude is unlikely to happen without co-opting a significant number of existing academics, labs and students in related fields (including mathematics and physics in addition to computer science). This is happening already, within ML groups but also physics (Max Tegmark’s students) and mathematics (e.g. some of my students at the University of Melbourne).
I have colleagues in my department publishing stacks of papers in CVPR, NeurIPS etc., which this community might call capabilities work. If I succeeded in convincing them to do some alignment or mechanistic interpretability work, they would do it because it was intrinsically interesting or likely to be high status. They would gravitate towards the kinds of work that are dual-use. Relative to the status quo that seems like progress to me, but I’m genuinely interested in the opinion of people here. Real success in this recruitment would, among other things, dilute the power of LW norms to influence things like publishing.
On balance it seems to me beneficial to aggressively recruit academics and their students into alignment and interpretability.