If interpretability research goes well, it may get dangerous
I’ve historically been pretty publicly supportive of interpretability research. I’m still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed.
I acknowledge that spreading research insights less widely comes with real research costs. I’d endorse building a cross-organization network of people who are committed to not using their understanding to push the capabilities frontier, and sharing freely within that.
I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn’t the case in real life.
It’s much more important that blatant and direct capabilities research be made private. Anyone fighting for people to keep their AI insights close to the chest, should be focusing on the capabilities work that’s happening out in the open, long before they focus on interpretability research.
Interpretability research is, I think, some of the best research that can be approached incrementally and by a large number of people, when it comes to improving our odds. (Which is not to say it doesn’t require vision and genius; I expect it requires that too.) I simultaneously think it’s entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don’t have good solutions. Reality doesn’t have to provide you any outs.
There’s a tradeoff here. And it’s not my tradeoff to make; researchers will have to figure out what they think of the costs and benefits. My guess is that the current field is not close to insights that would significantly improve capabilities, and that growing the field is important (and would be hindered by closure), and also that if the field succeeds to the degree required to move the strategic needle then it’s going to start stumbling across serious capabilities improvements before it saves us, and will need to start doing research privately before then.
I reiterate that I’d feel ~pure enthusiasm about a cross-organization network of people trying to understand modern AI systems and committed not to letting their insights push the capabilities frontier.
My goal in writing this post, though, is mostly to keep the Overton window open around the claim that there is in fact a tradeoff here, that there are reasons to close even interpretability research. Maybe those reasons should win out, or maybe they shouldn’t, but don’t let my praise of interpretability research obscure the fact that there are tradeoffs here.
- Against Almost Every Theory of Impact of Interpretability by 17 Aug 2023 18:44 UTC; 321 points) (
- Request: stop advancing AI capabilities by 26 May 2023 17:42 UTC; 154 points) (
- I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines by 20 Oct 2023 16:37 UTC; 118 points) (
- Should we publish mechanistic interpretability research? by 21 Apr 2023 16:19 UTC; 105 points) (
- Comments on Manheim’s “What’s in a Pause?” by 18 Sep 2023 12:16 UTC; 71 points) (EA Forum;
- Technical AI Safety Research Landscape [Slides] by 18 Sep 2023 13:56 UTC; 41 points) (
- Why I’m Not (Yet) A Full-Time Technical Alignment Researcher by 25 May 2023 1:26 UTC; 39 points) (
- Summaries of top forum posts (27th March to 16th April) by 17 Apr 2023 0:28 UTC; 31 points) (EA Forum;
- Technical AI Safety Research Landscape [Slides] by 18 Sep 2023 13:56 UTC; 29 points) (EA Forum;
- Quick takes on “AI is easy to control” by 2 Dec 2023 22:31 UTC; 26 points) (
- AGI-Automated Interpretability is Suicide by 10 May 2023 14:20 UTC; 25 points) (
- Discovering alignment windfalls reduces AI risk by 28 Feb 2024 21:14 UTC; 22 points) (EA Forum;
- 14 Jun 2023 5:00 UTC; 21 points) 's comment on Anthropic | Charting a Path to AI Accountability by (
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 19 points) (
- Discovering alignment windfalls reduces AI risk by 28 Feb 2024 21:23 UTC; 15 points) (
- The risk-reward tradeoff of interpretability research by 5 Jul 2023 17:05 UTC; 15 points) (
- Summaries of top forum posts (27th March to 16th April) by 17 Apr 2023 0:28 UTC; 14 points) (
- Why I’m Not (Yet) A Full-Time Technical Alignment Researcher by 25 May 2023 1:26 UTC; 11 points) (EA Forum;
- Why and When Interpretability Work is Dangerous by 28 May 2023 0:27 UTC; 6 points) (EA Forum;
- 25 May 2023 4:28 UTC; 3 points) 's comment on Why I’m Not (Yet) A Full-Time Technical Alignment Researcher by (
- Donate Now vs Donate Later—Relative Value of Donations to AI Alignment by 24 Jun 2023 17:20 UTC; 3 points) (
- Quick takes on “AI is easy to control” by 2 Dec 2023 22:33 UTC; -12 points) (EA Forum;
The problem here is that any effective alignment research is very powerful capability research, almost by definition. If one can actually steer or constrain a powerful AI system, this is a very powerful capability by itself and would enable all kinds of capability boosts.
And imagine one wants to study the core problem: how to preserve values and goals through recursive self-improvement and “sharp left turns”. And imagine that one would like to actually study this problem experimentally, and not just theoretically. Well, one can probably create a strictly bounded environment for “mini-foom” experiments (drastic changes in a really small, closed world). But all fruitful techniques for recursive self-improvement learned during such experiments would be immediately applicable for reckless recursive self-improvement in the wild.
How should we start addressing this?
Best I’ve got is to go dark once it feels like you’re really getting somewhere, and only work with people under NDAs (honour based or actually legally binding) from there on out. At least a facsimile of proper security, central white lists of orgs and people considered trustworthy, central standard communication protocols with security levels set up to facilitate communication between alignment researchers. Maybe a forum system that isn’t on the public net. Live with the decrease in research efficiency this brings, and try to make it to the finish line in time anyway.
If some org or people would make it their job to start developing and trial running these measures right now, I think that’d be great. I think even today, some researchers might be enabled to collaborate more by this.
Very open to alternate solutions that don’t cost so much efficiency if anyone can think of any, but I’ve got squat.
I hope people will ponder this.
Ideally, one wants “negative alignment tax”, so that aligned systems progress faster than the unaligned ones.
And if alignment work does lead to capability boost, one might get exactly that. But then suddenly people pursuing such work might find themselves actually grappling with all the responsibilities of being a capabilities leader. If they are focused on alignment, this presumably reduces the overall risks, but I don’t think we’ll ever end up being in a situation of zero risk.
I think we need to start talking about this, both in terms of policies of sharing/not sharing information and in terms of how we expect an alignment-focus organization to handle the risks, if it finds itself in a position when it might be ready to actually create a truly powerful AI way above state-of-the-art.
One major capabilities hurdle that’s related to interpretability: The difference between manually “opening up” the model to analyze its weights, etc., and being able to literally ask the model questions about why it did certain things.
And it seems like a path to solving that is to have the AI be able to analyze its own workinga, which seems like a potential path to recursive self improvement as well
One way that people think about the situation, which I think leads them to underestimate the costs of secrecy, is that they think about intepretability as a mostly theoretical research program. If you think of it that way, then I think it disguises the costs of secrecy.
But an addition, to a research program, interpretability is in part about producing useful technical artifacts for steering DL, i.e., standard interpretability tools. And technology becomes good because it is used.
It improves through tinkering, incremental change, and ten thousand slight changes in which each increase improves some positive quality by 10% individually. Look at what the first cars looked like and how many transformations they went through to get to where they are now. Or look it the history of the gun. Or, what is relevant for our causes, look at the continuing evolution of open source DL libraries from TF to PyTorch to PyTorch 2. This software became more powerful and more useful because thousands of people have contributed, complained, changed one line of documentation, added boolean flags, completely refactored, and so on and so forth.
If you think of interpretability being “solved” through the creation one big insight—I think it becomes more likely that interpretability could be closed without tremendous harm. But if you think of it being “solved” through the existence of an excellent
torch-shard-interpret
package used by everyone who uses PyTorch, together with corresponding libraries for Jax, then I think the costs of secrecy become much more obvious.Would this increase capabilities? Probably! But I think a world 5 years hence, where capabilities has been moved up 6 months relative to zero interpretability artifacts, but where everyone can look relatively easily into the guts of their models and in fact does so look to improve them, seems preferable to a world 6 months delayed but without these libraries.
I could be wrong about this being the correct framing. And of course, these frames must mix somewhat. But the above article seem to assume the research-insight framing, which I think is not obviously correct.
I was previously pretty dubious about interpretability results leading to capabilities advances. I’ve only really seen two papers which did this for LMs and they came from the same lab in the past few months.
It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance.[1]
But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research.
I’d be interested to hear if this isn’t actually how modern ML advances work.
It’s not clear what the ratio of capabilities/alignment progress is for interpretability. There is not empirical track record[^1] of interpretability feeding back into improvements of any kind.
A priori it seems like it would be good because understanding how things work is useful to understand their behavior better, and thus be able to tell whether or not a model is aligned or how to make it more so. But understanding how things work is also useful for making them more capable, e.g. if you use interpretability as a model-debugger, it’s basically general purpose for dealing with ML models.
[1]: known to the author
I don’t know if this is really a practical proposal. I think the benefits from sharing interpretability work would still outstrip the dangers; plenty of people do capability work and share it. And at least advances in capabilities based on interpretability work ought to be interpretable themselves.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Doesn’t this push more of the capabilities work that continues on unheeded into the category of those researchers who do not expect issues to arise in regards to alignment, or even those who assume that capabilities is equivalent to alignment? (I.e., for researchers somewhere in-between, who might be on the fence about whether to continue or not, the ones that choose to slow down capabilities research are ones who are less likely to believe that capabilities = alignment.)
Thus, we more likely enter a world in which the fastest-developing AI systems are made by teams who prefer sharing, tend to believe that capabilities solves alignment automatically, and also (according to my own reasoning thus far) probably believe that the “universe abstracts well”, in the sense that interpretability should be fairly straightforward, and the most capable AI models will also be the most interpretable, and vice-versa.
This last thought might be kind of interesting; It might imply that the latter category of researchers will tend to develop their AI in such a fashion. Therefore, how correct their overall models of AI turn out to be might also be reflected in how actually successful their capabilities progress is demonstrated.