Should we publish mechanistic interpretability research?
TL;DR: Multiple people have raised concerns about current mechanistic interpretability research having capabilities externalities. We discuss to which extent and which kind of mechanistic interpretability research we should publish. The core question we want to explore with this post is thus to which extent the statement “findings in mechanistic interpretability can increase capabilities faster than alignment” is true and should be a consideration. For example, foundational findings in mechanistic interpretability may lead to a better understanding of NNs which often straightforwardly generates new hypotheses to advance capabilities.
We argue that there is no general answer and the publishing decision primarily depends on how easily the work advances alignment in relation to how much it can be used to advance capabilities. We recommend a differential publishing process where work with high capabilities potential is initially only circulated with a small number of trusted people and organizations and work with low capabilities potential is published widely.
Related work: A note about differential technological development, Current themes in mechanistic interpretability research, Thoughts on AGI organization and capabilities work, Dan Hendrycks’s take, etc.
We have talked to lots of people about this question and lost track of who to thank individually. In case you talked to either Marius or Lawrence about this, thank you!
The basic case for publishing
Let’s revisit the basic cases for publishing.
Alignment is probably hard. To get even close to a solution, it likely requires many people working together, coordinating their work and patching together different approaches. If the work isn’t public, this becomes much harder and thus a solution to alignment becomes less likely.
More people can engage with the work and build on it. Especially people who want to get into the field or are less connected might not be able to access documents that are only shared in small circles.
It gives more legitimacy to the work of organisations and individuals. For people who are not yet established in the field, publishing their work is the most obvious way to get noticed by an organisation. For academics, publications are the most relevant resource for their careers and organisations can generate more legitimacy by publishing their work (e.g. for grantmakers or other organisations they want to interact with).
It is a form of movement building. If work on mechanistic interpretability is regularly shown on ML conferences and is available on arxiv, it is more likely that people outside of the alignment field notice that the field exists and get interested in it.
Publication leads to accountability and feedback. If you know you will publish something, you put more effort into explaining it well and ensuring that your findings are robust. Furthermore, it provides a possibility for other researchers to engage with your work and give you feedback for improvement or future research directions.
In addition, mechanistic interp seems especially well suited for publication in classic academic venues since it is less speculative than other AI safety work and overlaps with established academic fields.
Thus, publication seems robustly positive as long as it doesn’t advance capabilities more than alignment (which is often hard to predict in advance). The crux of this post, therefore, lies mainly in the possible negative externalities of publications and how they trade off against the alignment benefits.
Capabilities externalities of mechanistic interpretability
The primary reason to think that mechanistic interpretability has large capabilities is that understanding a system better makes improvements easier. Historically a lot of applications were a downstream effect of previous foundational research. This seems true for a lot of scientific advances in general, e.g. improved understanding of biology lead to better medicine, but also for ML applications in particular. We decompose the capabilities externalities into multiple different paths.
Direct insights: Interpretability work might directly produce an insight that increases capabilities, e.g. an attempt to make a network more interpretable might make it more capable. Empirically, all attempts to make networks more interpretable we are aware of have not yet made networks more efficient/capable. (That being said, some work like the SoLU paper does show that some gains in interpretability can come at near zero loss in capabilities).
Motivation for new insight: Interpretability work can generate an insight that is then used as a motivation for capabilities work. Two recent capability papers (see here and here) have cited the induction heads work by Anthropic as a core motivation (confirmed by authors).
Supporting capabilities work: Interpretability tools can often speed up the iterations of capability work. For example, if you would be able to reliably detect if a network has learned a specific concept or understand the circuit responsible for that behavior, you might be able to iterate much faster with fine-tuning the model to have that property. In general, better and more detailed evaluations from interpretability decrease the time of iterations.
It’s worth noting that historically, most capability advances were not the result of a detailed understanding of NNs—rather, they were the result of a mix of high-level insights and trial and error. In particular, it seems that existing work interpreting concrete networks has been counterfactually responsible for very few capability gains (other than the two cases cited above).
Unfortunately, we think it’s likely that the potential capability implications of interpretability are proportional to its usefulness for alignment, i.e. better interpretability tools are both better for safety and also increase capabilities more since they yield better insights. Thus, the historic lack of capabilities advances from mechanistic interpretability potentially just indicate that interpretability is too far behind the state of the art to be useful at the moment. But that could change once it catches up.
Other possible reasons why publishing could be net negative
We think these are much less relevant than the capabilities externalities but still worth mentioning.
Leaking the alignment test set. If we develop powerful tools and benchmarks to detect deception in real-world models, organizations might start to train against these tools and benchmarks. Thus, they might overfit to the specific subset of ideas we came up with but their models are not robust against the kinds of deception we haven’t thought of. It’s important to keep some amount of safety or alignment evaluations “in reserve”, so as to get a better sense of how the models will perform in deployment.
General hype around AI. Just like GPT-N creates hype around AI, understanding the inner workings of NNs can be fascinating and gets more people excited about the field of AI. However, we think the hype coming from mechanistic interpretability is much smaller than from other work. Furthermore, hype about mechanistic interpretability could even move some researchers from generic capabilities work to interpretability.
Future AIs might read some of our work. Future models will likely be able to browse the internet and understand the content that describes how they work and how we measure their safety. While there are ways to prevent that, e.g. filtering the content that AIs can read, we should expect none of these filters to be entirely bullet-proof.
It’s hard/impossible to reverse a publication decision. Once an insight is published on the internet it is very hard/impossible to take back. Some people might have downloaded the post or they can find the article with help of the wayback machine. Even if we were able to successfully remove all of the information from the internet, someone might just remember the method and reimplement it from scratch.
Most people don’t care about alignment Most people who work on ML do not care about alignment but they either directly care about improving the capabilities of ML systems or they are otherwise incentivized to do so (e.g. publications, job opportunities, etc.). Therefore, if you publish to a large audience (e.g. Twitter or ML conference), most of the people who read your work will care more about advancing capabilities than alignment (the alignment people might read it more carefully though).
Impactful findings are heavy-tailed. Most insights are unlikely to lead to major advances—neither for alignment nor for capabilities. However, some findings have disproportionately large effects. For example, the original transformer paper likely had major consequences for the entire field of AI. The upsides are also heavy-tailed which makes this argument hard to evaluate in practice. For example, a major breakthrough in understanding superposition could both lead to huge leaps in our ability to understand the computations of LLMs but could also lead to more efficient architectures that e.g reduce the compute cost of transformers by an OOM.
The main takeaway from this section is that a small number of publishing decisions will carry most of the impact. For example, if you publish 5 minor things and it goes well (i.e. doesn’t lead to harm), you should not conclude that a major publication will also go well.
The previous considerations were presented in a vacuum, i.e. we presented effects that are plausibly negative. However, the alternative world could be even worse and we thus look at counterfactual considerations.
Interpretability might be very important to alignment and it’s thus worth taking the trade-off. Even if it is true that mechanistic interpretability can lead to additional capabilities, it might still be better on net to publish them. If we believe, for example, that deception is the problem that will ultimately lead to catastrophic failures and that understanding the internal cognition of models is necessary to solve deception. Additionally, interpretability might be especially well-suited for fieldbuilding. Under these assumptions, it might be more important to spread information about interpretability even if they carry some risk because the counterfactual would be worse.
Capabilities that rely on a better understanding of the model are better for alignment. Let’s say, for example, that mechanistic interpretability leads to a breakthrough in understanding superposition in transformers. Assume that, as a result, transformers would be more compute-efficient but also more interpretable. A world in which we build more efficient and more interpretable models might be better than a world in which we create capabilities through scale and trial and error. However, these more interpretable architectures would likely be quickly combined or modified in trial-and-error ways that reduce their interpretability.
Obviously, publishing can mean very different things. In order of increasing effort and readership size, publishing can mean:
Publishing to the floating google doc economy: This means sending a google doc or private preprint that describe your findings to a small circle of people who you expect to benefit from seeing the research and trust to use it confidentially. This inner circle usually does not include people who are new to the field.
Posting on LW/AF: While not everyone who reads LW/AF cares about alignment, it is most likely the public venue with the highest ratio of people who are about alignment vs. who don’t.
Publishing to Arxiv: If you want to be recognized by anyone in academia, your paper has to be at least on arxiv. Blog posts, no matter how good, usually do not count as research in traditional academia.
Submitting to a conference: A publication on a top conference will give you reputation among academics and your readership is likely much wider and more heterogeneous than on LW/AF.
A caveat: while the average LW/AF post has relatively low readership and a high alignment/non-alignment ratio, viral LW/AF posts can reach a wide audience. For example, after its tweet thread went viral, Neel Nanda’s grokking work was widely circulated amongst the academic interpretability community. Given the hit-based nature of virality, there isn’t a particular place to put it on the hierarchy, but it’s worth noting that Twitter threads can sometimes greatly increase the publicity of a post.
Be cautious by default, it’s hard to unpublish. Assume you had a very foundational insight about how NNs work. You are excited and publish it somewhere on the internet. Through two corners you hear that people at BrainMind™ saw your idea and use it to make their current models more capable. There is no way to unpublish, the internet remembers.
Which reach do you want to have with that post? Most of the direct alignment benefits for any project come from a relatively small community of alignment-oriented researchers who work on similar topics. All of the other benefits are either more vague, e.g. building a community of people who do interpretability, or more personal, e.g. the need for citations in an academic career.
Assess the trade-off. Different projects imply different policies. Intuitively more foundational insights have higher potential downsides because they can lead to major changes in architectures or training regimes. Detection-focused techniques such as probing, seem to carry less risk because they usually don’t have major implications and projects that find more evidence for a previous finding (e.g. “we found another circuit”) carry very low risk.
In case you’re unsure, ask someone who is more senior. Assessing a specific project is hard, especially if you’re less senior. Thus, reaching out to someone who has more expertise or reading through the opinions below, might give you a sense of the different considerations.
Don’t worry too much about it if it’s your first project and unmentored. Sometimes people who want to get started in the field are paralyzed because they think their project might be infohazardous. We think it is very unlikely that the first project people try (especially without mentorship) has the potential to cause a lot of harm. In case you’re really convinced that there is cause for concern, reach out to someone more senior.
Different people have very different opinions about this question and it seems hard to combine them. Thus, we decided to ask multiple people in the alignment scene about their stance on this question.
It seems really straightforwardly good to me to publish (almost all) mechanistic interpretability work; it’s so far down the list of things we should be worrying about that by default I assume that objections to it are more strongly motivated by deontological rather than impact considerations. I’m generally skeptical of applying new deontological rules to decisions this complex; but even if we’re focusing on deontology, there are much bigger priorities to consider (e.g. EA’s relationship to AI labs).
I enjoyed these thoughts, and there are a few overarching thoughts I have on it.
Is mechanistic interpretability the best category to ask this question about? On one hand, this may be a special case of a broader point involving research in the science of ML. Worries about risks from basic research insights do not seem unique to interpretability. On the other hand, different types of (mechanistic) interpretability research seem likely to have very different implications for safety vs. risky capabilities.
Interpretability tools often trade off with capabilities. In these cases, it might be extra important to publish. For example, disentanglement techniques, adversarial training, modularity techniques, bottlenecking, compression, etc. are all examples of things that tend to harm a network’s performance on the task while making them more interpretable. There are exceptions to this like model editing tools and some architectures. But overall, it seems that in most existing examples, more interpretable models are less capable.
Mechanistic interpretability may not be the elephant in the room. In general, lots of mechanistic interpretability work might not be very relevant for engineers – either for safety or capabilities. There’s a good chance this type of research might not be key either way, especially on short timelines. Meanwhile, RLHF is currently changing the world, and this has now prompted hasty retrospectives about whether it was net good, bad, or ok. At a minimum, mechanistic interpretability is not the only alignment-relevant work that all of these questions should be asked about.
If work is too risky to publish, it may often be good to avoid working on it at all. Pivotal acts could be great. And helping more risk-averse developers of TAI be more competitive seems good. But infohazardous work comes with inherent risks from misuse and copycatting. Much of the time, it may be useful to prioritize work that is more robustly good. And when risky things are worked on, it should be by people who are divested from potential windfalls resulting from it.
I strongly agree that this is a thing we should be thinking about. That mechanistic interpretability has failed to meaningfully enhance capabilities so far is, I think, largely owed to current interpretability being really bad. The field has barely figured out the first thing about NN insides yet. I think the level of understanding needed for proper alignment and deception detection is massively above what we currently have. To give a rough idea, I think you probably need to understand LLMs well enough to be able to code one up by hand in C, without using any numerical optimisation, and have it be roughly as good as GPT-2. I would expect that level of insight to have a high risk of leading to massive capability improvements, since I see little indication that current architectures, which were found mostly through not-very-educated guesswork, are anywhere near the upper limit of what the hardware and data sets allow.
I would go further and suggest that we need to plan for what happens if we have some success and see fundamental insights that seem crucial for alignment, but might also be used to make a superintelligence. How do you keep researching and working with collaborators safely in that information environment? There does not currently exist much of an infrastructure for different orgs and researchers to talk to another under some level of security and trust. If not proper NDAs and vetting, at least the early establishment of stronger ecosystem norms around secrecy, and the normalization of legally non-binding, honor based NDAs might be in order. If we don’t do it now, it might become a roadblock that eats up valuable time later, near the end, when time is even more precious.
TLDR: I care most about how we prioritise research and how we shape the culture of the field. I want mech interp to be a large and thriving field, but one that prizes genuine scientific understanding and steerability of systems, and not a field where the goal is to make a number go up, or where capabilities advancements feel intrinsically high status. Thinking through whether to publish something does matter on the margin and should be done, but it’s a lower order bit—most research that could directly cause harm if published is probably just not worth doing! And getting good at interpretability seems really important, in a way that makes me pretty opposed to secrecy or paralysis. I’m concerned about direct effects on capabilities (a la induction heads, or even more directly producing relevant ideas), but think that worrying about indirectly accelerating capabilities via eg field-building or producing fundamental insights about models that someone then builds on, is too hard and paralysing to be worthwhile. People new to the field tend to worry about this way too much, and should chill out.
I think these are important, but hard and thorny questions, and it’s easy to end paralysed, or to avoid significant positive impact by being too conservative here. But also worth trying to carefully think through.
The most important question to me is not publication norms, but what research we do in the first place, and the norms in the field of what good research looks like. To me this is the highest leverage thing, especially as the field is small yet growing fast. My vision for the field of mechanistic interpretability is one that prizes rigorous, scientific understanding of models. And I personally judge by how well it tracked truth and taught me things about models, rather than whether it made models better. I’ll feel pretty sad if we build a field of mech interp where the high-status thing to do is to push hard on making models better.
To me this is a much more important question than publication norms—if you do research that’s net bad to publish, probably it would have been better to do something else with a clearer net win for alignment, all other things being the same. This can be hard to tell ahead of time, so I think this is worth thinking through before publishing, but that’s a lower-order bit.
At a high level, I think that getting good at interpretability seems crucial to alignment going well (whether via mech interp or some other angle of attack), and we aren’t very good at it yet! Further, it’s a very hard problem, yet with lots of surface area to get traction on it, and I would like there to be a large and thriving field of mech interp. This includes significant effort from academia and outside the alignment community, which means having people who are excited about capabilities advancement. This means accepting that “do no harm” is an unrealistically high standard, and I mostly want to go full steam ahead on doing great mech interp work and publishing it and making it easy to build upon. I think that tracking the indirect effects here is hard and likely ineffective and unhelpful. Though I do think that “would this result directly help a capabilities researcher, in a way that does not result in interpretability understanding” is a question worth thinking about
I mostly think that interpretability so far has had fairly little impact on capabilities or alignment, but mostly because we aren’t very good at it! If the ambitious claims of really understanding a system hold, then I expect this to be valuable for both, (in a way that’s pretty correlated) though it seems far better on net than most capabilities work! We should plan for success—if we remain this bad at interpretability we should just give up and do something else. So to me the interesting question is how much there are research directions that push much harder on capabilities than alignment.
One area that’s particularly interesting to me is using interpretability to make systems more steerable, like interpretability-assisted RLHF. This seems like it easily boosts capabilities and alignment, but IMO is a pretty important thing to practice and test, and see what it takes to get good at this in practice (or if it just breaks the interpretability techniques and makes the failures more subtle!).
Using mech interp to analyse fundamental scientific questions in deep learning like superposition is more confusing to me. I would mostly guess it’s harmless (eg I would be pretty surprised if my grokking work is directly useful for capabilities!). For some specific questions like superposition, I think that better understanding this is one of the biggest open problems in mech interp, and well worth the capabilities externalities!
A final note is that these thoughts are aimed more at established researchers, and how we should think as we grow the field. I often see people new to the field, working independently without a mentor, who are very stressed about this. I think this is highly unproductive—an unmentored first project is probably not going to produce good research, let alone accidentally produce a real capabilities advance, and you should prioritise learning and seeing how much you enjoy the research.
Depending on when this document is circulated, I either have a post in my drafts folder on this topic, or I have recently posted my thoughts on this topic. I agree that the situation is pretty thorny. If the choice were all up to me and I had magic coordination powers, I’d create a large and thriving interpretability community that was committed to research closure relative to the larger world, while sharing research freely within the community, and while committing not to use the fruits of that research for capabilities advancements (until humanity understands intelligence well enough to use that knowledge wisely).
This likely depends a lot on how the “solution to superposition” looks like. A sparse coding scheme is less likely to be capabilities advancing than a fundamental insight into transformers that allows us to decode superposed features everywhere in the network.
Note that this goes both ways – just because mech interp. has not been particularly useful for alignment right now, does not mean that future work won’t!