I think I would particularly critique DeepMind and OpenAI’s interpretability works, as I don’t see how this reduces risks more than other works that they could be doing, and I’d appreciate a written plan of what they expect to achieve.
I can’t speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:
It is possible to automatically make and verify claims about what topics a model is internally “thinking about” when answering a question. This is integrated into debate, and allows debaters to critique each other’s internal reasoning, not just the arguments they externally make.
(It’s unclear how much this buys you on top of cross-examination.)
It is possible to automatically identify “cruxes” for the model’s outputs, making it easier for adversaries to design situations that flip the crux without flipping the overall correct decision.
Redwood’s adversarial training project is roughly in this category, where the interpretability technique is saliency, specifically magnitude of gradient of the classifier output w.r.t. the token embedding.
(Yes, typical mech interp directions are far more detailed than saliency. The hope is that they would produce affordances significantly more helpful and robust than saliency.)
A different theory of change for the same affordance is to use it to analyze warning shots, to understand the underlying cause of the warning shot (was it deceptive alignment? specification gaming? mistake from not knowing a relevant fact? etc).
I don’t usually try to backchain too hard from these theories of change to work done today; I think it’s going to be very difficult to predict in advance what kind of affordances we might build in the future with years’ more work (similarly to Richard’s comment, though I’m focused more on affordances than principled understanding of deep learning; I like principled understanding of deep learning but wouldn’t be doing basic research on interpretability if that was my goal).
My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build. As an example of how I reason about what projects to do, I’m now somewhat less excited about projects that do manual circuit analysis of an algorithmic task. They do still teach us new stylized facts about LLMs like “there are often multiple algorithms at different ‘strengths’ spread across the model” that can help with future mech interp, but overall it feels like these projects aren’t pushing the boundaries as much as seems possible, because we’re using the same, relatively-well-vetted techniques for all of these projects.
I’m also more keen on applying interpretability to downstream tasks (e.g. fixing issues in a model, generating adversarial examples), but not necessarily because I think it will be better than alternative methods today, but rather because I think the downstream task keeps you honest (if you don’t actually understand what’s going on, you’ll fail at the task) and because I think practice with downstream tasks will help us notice which problems are important to solve vs. which can be set aside. This is an area where other people disagree with me (and I’m somewhat sympathetic to their views, e.g. that the work that best targets a downstream task won’t tackle fundamental interp challenges like superposition as well as work that is directly trying to tackle those fundamental challenges).
(EDIT: I mostly agree with Ryan’s comment, and I’ll note that I am considering a much wider category of work than he is, which is part of why I usually say “interpretability” rather than “mechanistic interpretability”.)
Separately, you say:
I don’t see how this reduces risks more than other works that they could be doing
I’m not actually sure why you believe this. I think on the views you’ve expressed in this post (which, to be clear, I often disagree with), I feel like you should think that most of our work is just as bad as interpretability.
In particular we’re typically in the business of building aligned models. As far as I can tell, you think that interpretability can’t be used for this because (1) it is dual use, and (2) if you optimize against it, you are in part optimizing for the AI system to trick your interpretability tools. But these two points seem to apply to any alignment technique that is aiming to build aligned models. So I’m not sure what other work (within the “build aligned models” category) you think we could be doing that is better than interpretability.
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of “indefinite, or at least very long, pause on AI progress”. If that’s your position I wish you would have instead written a post that was instead titled “against almost every theory of impact of alignment” or something like that.)
To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I’m not sure that’s the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.
I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.
High level strategy “as primarily a bet on creating new affordances upon which new alignment techniques can be built”.
Makes sense, but I think this is not the optimal resource allocation. I explain why below:
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of “indefinite, or at least very long, pause on AI progress”. If that’s your position I wish you would have instead written a post that was instead titled “against almost every theory of impact of alignment” or something like that.)
Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination. A quote that explains my reasoning well is the following:
“That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die—there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget thatnon-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).”
That’s why I really appreciate Dan Hendryck’s work on coordination. And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance. We’ve talked a bit during the EAG, and I understood that there’s something like a numerus clausus in DeepMind’s safety team. In that case, since interpretability doesn’t require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs.
My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build.
I agree with this perspective if we can afford the time to perform interpretability work on all of model setups—which our head count is too low to do that. Given the urgency to address the alignment challenge quickly, it’s better to encourage (or even prioritize) conceptually sound interpretability work rather than speculative approaches.
I can’t speak on behalf of Google DeepMind or even just the interpretability team (individual researchers have pretty different views), but I personally think of our interpretability work as primarily a bet on creating new affordances upon which new alignment techniques can be built, or existing alignment techniques can be enhanced. For example:
It is possible to automatically make and verify claims about what topics a model is internally “thinking about” when answering a question. This is integrated into debate, and allows debaters to critique each other’s internal reasoning, not just the arguments they externally make.
(It’s unclear how much this buys you on top of cross-examination.)
It is possible to automatically identify “cruxes” for the model’s outputs, making it easier for adversaries to design situations that flip the crux without flipping the overall correct decision.
Redwood’s adversarial training project is roughly in this category, where the interpretability technique is saliency, specifically magnitude of gradient of the classifier output w.r.t. the token embedding.
(Yes, typical mech interp directions are far more detailed than saliency. The hope is that they would produce affordances significantly more helpful and robust than saliency.)
A different theory of change for the same affordance is to use it to analyze warning shots, to understand the underlying cause of the warning shot (was it deceptive alignment? specification gaming? mistake from not knowing a relevant fact? etc).
I don’t usually try to backchain too hard from these theories of change to work done today; I think it’s going to be very difficult to predict in advance what kind of affordances we might build in the future with years’ more work (similarly to Richard’s comment, though I’m focused more on affordances than principled understanding of deep learning; I like principled understanding of deep learning but wouldn’t be doing basic research on interpretability if that was my goal).
My attitude is much more that we should be pushing on the boundaries of what interp can do, and as we do so we can keep looking out for new affordances that we can build. As an example of how I reason about what projects to do, I’m now somewhat less excited about projects that do manual circuit analysis of an algorithmic task. They do still teach us new stylized facts about LLMs like “there are often multiple algorithms at different ‘strengths’ spread across the model” that can help with future mech interp, but overall it feels like these projects aren’t pushing the boundaries as much as seems possible, because we’re using the same, relatively-well-vetted techniques for all of these projects.
I’m also more keen on applying interpretability to downstream tasks (e.g. fixing issues in a model, generating adversarial examples), but not necessarily because I think it will be better than alternative methods today, but rather because I think the downstream task keeps you honest (if you don’t actually understand what’s going on, you’ll fail at the task) and because I think practice with downstream tasks will help us notice which problems are important to solve vs. which can be set aside. This is an area where other people disagree with me (and I’m somewhat sympathetic to their views, e.g. that the work that best targets a downstream task won’t tackle fundamental interp challenges like superposition as well as work that is directly trying to tackle those fundamental challenges).
(EDIT: I mostly agree with Ryan’s comment, and I’ll note that I am considering a much wider category of work than he is, which is part of why I usually say “interpretability” rather than “mechanistic interpretability”.)
Separately, you say:
I’m not actually sure why you believe this. I think on the views you’ve expressed in this post (which, to be clear, I often disagree with), I feel like you should think that most of our work is just as bad as interpretability.
In particular we’re typically in the business of building aligned models. As far as I can tell, you think that interpretability can’t be used for this because (1) it is dual use, and (2) if you optimize against it, you are in part optimizing for the AI system to trick your interpretability tools. But these two points seem to apply to any alignment technique that is aiming to build aligned models. So I’m not sure what other work (within the “build aligned models” category) you think we could be doing that is better than interpretability.
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of “indefinite, or at least very long, pause on AI progress”. If that’s your position I wish you would have instead written a post that was instead titled “against almost every theory of impact of alignment” or something like that.)
To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I’m not sure that’s the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.
I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.
Makes sense, but I think this is not the optimal resource allocation. I explain why below:
Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination. A quote that explains my reasoning well is the following:
“That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die—there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).”
That’s why I really appreciate Dan Hendryck’s work on coordination. And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance. We’ve talked a bit during the EAG, and I understood that there’s something like a numerus clausus in DeepMind’s safety team. In that case, since interpretability doesn’t require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs.
For example, some of your works like Model evaluation for extreme risks, or Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals, are great for such purpose!
I agree with this perspective if we can afford the time to perform interpretability work on all of model setups—which our head count is too low to do that. Given the urgency to address the alignment challenge quickly, it’s better to encourage (or even prioritize) conceptually sound interpretability work rather than speculative approaches.