This is probably one of the most influential papers that I’ve supervised, and my most cited MATS paper (400+ citations).
For a period, a common answers when I asked people what got them into mechanistic interpretability was this paper.
I often meet people who incorrectly think that this paper introduced the technique of steering vectors.
This inspired at least some research within all of the frontier labs
There have been a bunch of follow on papers, one of my favourites was this Meta paper on guarding against the technique
The technique has been widely used in open source circles. There are about 5,000 models on Hugging Face with “abliterated” in the name (an adaptation of our technique that really took off), three have 100K+ downloads
I find this all very interesting and surprising. This was originally a project on trying to understand the circuit behind refusal, and this was a fallback idea that Andy came up with using some of his partial results to jailbreak a model. Even at the time, I basically considered it standard wisdom that X is a single direction was true for most concepts X. So I didn’t think the paper was that big a deal. I was wrong!
So, what to make of all this? Part of the lesson is the importance of finding a compelling application. This paper is now one of my favourite short examples of how interpretability is real: it’s an interesting, hard to fake thing that is straightforward to achieve with interpretability techniques. My guess is that this was a large part in capturing people’s imaginations. And many people had not come across the idea that concepts are often single directions. This was a very compelling demonstration, and we thus accidentally claimed some credit for that broad finding. I don’t think this was a big influence on my shift to caring about pragmatic interpretability, but definitely aligns with that.
Was this net good to publish? This was something we discussed somewhat at the time. I basically thought this was clearly fine on the grounds that it was already well known that you could cheaply fine-tune a model to be jailbroken, so anyone who actually cared would just do that. And for open source models, you only need one person who actually cares for people to use it.
I was wrong on that one—there was real demand! My guess is that the key thing is that this is easier and cheaper to do than finetuning, along with some other people making good libraries and tutorials for it. Especially in low resource open source circles this is very important.
But I do think that jailbroken open models would have been available either way, and this hasn’t really made a big difference on that front. I hoped it would make people more aware of the fragility of open source safeguards—it probably did help, but idk if that led to any change. My guess is that all of these impacts aren’t that significant, and the impact on the research field dominates.
This is probably one of the most influential papers that I’ve supervised, and my most cited MATS paper (400+ citations).
For a period, a common answers when I asked people what got them into mechanistic interpretability was this paper.
I often meet people who incorrectly think that this paper introduced the technique of steering vectors.
This inspired at least some research within all of the frontier labs
There have been a bunch of follow on papers, one of my favourites was this Meta paper on guarding against the technique
The technique has been widely used in open source circles. There are about 5,000 models on Hugging Face with “abliterated” in the name (an adaptation of our technique that really took off), three have 100K+ downloads
I find this all very interesting and surprising. This was originally a project on trying to understand the circuit behind refusal, and this was a fallback idea that Andy came up with using some of his partial results to jailbreak a model. Even at the time, I basically considered it standard wisdom that X is a single direction was true for most concepts X. So I didn’t think the paper was that big a deal. I was wrong!
So, what to make of all this? Part of the lesson is the importance of finding a compelling application. This paper is now one of my favourite short examples of how interpretability is real: it’s an interesting, hard to fake thing that is straightforward to achieve with interpretability techniques. My guess is that this was a large part in capturing people’s imaginations. And many people had not come across the idea that concepts are often single directions. This was a very compelling demonstration, and we thus accidentally claimed some credit for that broad finding. I don’t think this was a big influence on my shift to caring about pragmatic interpretability, but definitely aligns with that.
Was this net good to publish? This was something we discussed somewhat at the time. I basically thought this was clearly fine on the grounds that it was already well known that you could cheaply fine-tune a model to be jailbroken, so anyone who actually cared would just do that. And for open source models, you only need one person who actually cares for people to use it.
I was wrong on that one—there was real demand! My guess is that the key thing is that this is easier and cheaper to do than finetuning, along with some other people making good libraries and tutorials for it. Especially in low resource open source circles this is very important.
But I do think that jailbroken open models would have been available either way, and this hasn’t really made a big difference on that front. I hoped it would make people more aware of the fragility of open source safeguards—it probably did help, but idk if that led to any change. My guess is that all of these impacts aren’t that significant, and the impact on the research field dominates.