Can you elaborate on why you disagree?
Neel Nanda
I imagine they did them on smaller models, plausibly on less total data, which is expensive but not exorbitant
Test your interpretability techniques by de-censoring Chinese models
do you mean the only way to meaningfully answer this would need access to non-public data
That, unfortunately. Frontier labs rarely ever share research that helps improve the capabilities of frontier models (this will vary between the lab, of course, and many are still good about publishing commercially useful safety work)
This analysis is confounded by the fact that GDM has a lot more non Gemini stuff (eg the science work) than the other labs. None of the labs publish most of their LLM capabilities work, but publishing science stuff is fine, so DeepMind having more other stuff means comparatively more non safety work gets published
I generally think you can’t really answer this question with the data sources you’re using, because IMO the key question is what fraction of the frontier LLM oriented with is on safety, but little of that is published.
Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought
Out-of-context reasoning, the phenomenon where models can learn much more general, unifying structure when fine-tuned on something fairly specific, was a pretty important update to my mental model of how neural networks work.
This paper wasn’t the first, but it was one of the more clean and compelling early examples (though emergent misalignment is now the most famous).
After staring at it for a while, I now feel less surprised by out-of-context reasoning. Mechanistically, there’s no reason the model couldn’t learn the generalizing solution. And on a task like this, the generalizing solution is just simpler and more efficient, and there’s gradient signal for it (at least if there’s a linear representation), so it’s natural to learn it. But it’s easy to say something’s obvious in hindsight. I would not have predicted this, and it has improved my mental model of neural networks. I think this is important and valuable!
I like this post. It’s a simple idea that was original to me, and seems to basically work.
In particular, it seems able to discover things about a model we might not have expected. I generally think that each additional unsupervised technique, ie that can discover unexpected insights, is valuable, because each additional technique is another shot on goal that might find what’s really going on. So the more the better!
I have not, in practice, seen MELBO used that much, which is a shame. But I think the core idea seems sound
I feel pretty great about this post. It likely took five to ten hours of my time, and I think it has been useful to a lot of people. I have pointed many people to this post since writing it, and I imagine many other newcomers to the field have read it.
I generally think there is a gap where experienced researchers can use their accumulated knowledge to create field-building materials fairly easily that are extremely useful to newcomers to the field, but don’t (typically because they’re busy—see how I haven’t updated this yet). I’d love to see more people making stuff like this!
Stylistically, I like that this was opinionated. I wasn’t just trying to survey all of the papers, which I think is often not that helpful because lots of papers are not worth reading. Instead, I was editorializing and trying to give summaries, context, and highlight key parts and how to think about the papers, all of which I think make this a more useful guide to newcomers.
One of the annoying things is that posts like this get out of date fairly quickly. This one direly needs an update, both since there has been a year and a half of progress and since my thoughts on interpretability have moved on a reasonable amount since I wrote it. But I think that even getting about a year of use out of this is a solid use of time.
This is probably one of the most influential papers that I’ve supervised, and my most cited MATS paper (400+ citations).
For a period, a common answers when I asked people what got them into mechanistic interpretability was this paper.
I often meet people who incorrectly think that this paper introduced the technique of steering vectors.
This inspired at least some research within all of the frontier labs
There have been a bunch of follow on papers, one of my favourites was this Meta paper on guarding against the technique
The technique has been widely used in open source circles. There are about 5,000 models on Hugging Face with “abliterated” in the name (an adaptation of our technique that really took off), three have 100K+ downloads
I find this all very interesting and surprising. This was originally a project on trying to understand the circuit behind refusal, and this was a fallback idea that Andy came up with using some of his partial results to jailbreak a model. Even at the time, I basically considered it standard wisdom that X is a single direction was true for most concepts X. So I didn’t think the paper was that big a deal. I was wrong!
So, what to make of all this? Part of the lesson is the importance of finding a compelling application. This paper is now one of my favourite short examples of how interpretability is real: it’s an interesting, hard to fake thing that is straightforward to achieve with interpretability techniques. My guess is that this was a large part in capturing people’s imaginations. And many people had not come across the idea that concepts are often single directions. This was a very compelling demonstration, and we thus accidentally claimed some credit for that broad finding. I don’t think this was a big influence on my shift to caring about pragmatic interpretability, but definitely aligns with that.
Was this net good to publish? This was something we discussed somewhat at the time. I basically thought this was clearly fine on the grounds that it was already well known that you could cheaply fine-tune a model to be jailbroken, so anyone who actually cared would just do that. And for open source models, you only need one person who actually cares for people to use it.
I was wrong on that one—there was real demand! My guess is that the key thing is that this is easier and cheaper to do than finetuning, along with some other people making good libraries and tutorials for it. Especially in low resource open source circles this is very important.
But I do think that jailbroken open models would have been available either way, and this hasn’t really made a big difference on that front. I hoped it would make people more aware of the fragility of open source safeguards—it probably did help, but idk if that led to any change. My guess is that all of these impacts aren’t that significant, and the impact on the research field dominates.
Brief Explorations in LLM Value Rankings
Sorry I don’t quite understand your arguments here. Five tokens back and 10 tokens back are both within what I would consider short-range context EG. They are typically within the same sentence. I’m not saying there are layers that look at the most recent five tokens and then other layers that look at the most recent 10 tokens etc. It’s more that early layers aren’t looking 100 tokens back as much. The lines being parallelish is consistent with our hypothesis
Ah, thanks, I missed that part.
Thanks for the addendum! I broadly agree with “I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption.”, maybe scoping the assumption to my personal life (I very much endorse working on reducing tail risks!)
I disagree with the “a prediction” argument though. Being >50% likely to happen does not mean people shouldn’t give significant mental space to the other, less likely outcomes. This is not how normal people live their lives, nor how I think they should. For example, people don’t smoke because they want to avoid lung cancer, but their chances of dying of this are well under 50% (I think?). People don’t do dangerous extreme sports, even though most people doing them don’t die. People wear seatbelts even though they’re pretty unlikely to die in a car accident. Parents make all kinds of decisions to protect their children from much smaller risks. The bar for “not worth thinking about” is well under 1% IMO. Of course “can you reasonably affect it” is a big Q. I do think there are various bad outcomes short of human extinction, eg worlds of massive inequality, where actions taken now might matter a lot for your personal outcomes.
Principled Interpretability of Reward Hacking in Closed Frontier Models
Yeah, I feel like the title of this post should be something like “act like you will be OK” (which I think is pretty reasonable advice!)
Interesting. Do you have the stats for the rate of growth in the number of mentors meeting your bar (ignoring capacity constraints, ie that you think would be a good mentor)? I’m surprised the rate of growth there is higher and I’m not sure if this is MATS becoming higher profile and drawing in more existing mentors, more people who are not suitable for being a mentor applying or AI safety actually making progress on the mentorship bottleneck