A Longlist of Theories of Impact for Interpretability

I hear a lot of different arguments floating around for exactly how mechanistically interpretability research will reduce x-risk. As an interpretability researcher, forming clearer thoughts on this is pretty important to me! As a preliminary step, I’ve compiled a list with a longlist of 19 different arguments I’ve heard for why interpretability matters. These are pretty scattered and early stage thoughts (and emphatically my personal opinion than the official opinion of Anthropic!), but I’m sharing them in the hopes that this is interesting to people

(Note: I have not thought hard about this categorisation! Some of these overlap substantially, but feel subtly different in my head. I was not optimising for concision and having few categories, and expect I could cut this down substantially with effort)

Credit to Evan Hubinger for writing the excellent Chris Olah’s Views on AGI Safety, which was the source of several of these arguments!

  1. Force-multiplier on alignment research: We can analyse a model to see why it gives misaligned answers, and what’s going wrong. This gets much richer data on empirical alignment work, and lets it progress faster

  2. Better prediction of future systems: Interpretability may enable a better mechanistic understanding of the principles of how ML systems and work, and how they change with scale, analogous to scientific laws. This allows us to better extrapolate from current systems to future systems, in a similar sense to scaling laws.

    1. Eg, observing phase changes a la induction heads shows us that models may rapidly gain capabilities during training

  3. Auditing: We get a Mulligan. After training a system, we can check for misalignment, and only deploy if we’re confident it’s safe

  4. Auditing for deception: Similar to auditing, we may be able detect deception in a model

    1. This is a much lower bar than fully auditing a model, and is plausibly something we could do with just the ability to look at random bits of the model and identify circuits/​features—I see this more as a theory of change for ‘worlds where interpretability is harder than I hope’

  5. Enabling coordination/​cooperation: If different actors can interpret each other’s systems, it’s much easier to trust other actors to behave sensibly and coordinate better

  6. Empirical evidence for/​against threat models: We can look for empirical examples of theorised future threat models, eg inner misalignment

    1. Coordinating work on threat models: If we can find empirical examples of eg inner misalignment, it seems much easier to convince skeptics this is an issue, and maybe get more people to work on it.

    2. Coordinating a slowdown: If alignment is really hard, it seems much easier to coordinate caution/​a slowdown of the field with eg empirical examples of models that seem aligned but are actually deceptive

  7. Improving human feedback: Rather than training models to just do the right things, we can train them to do the right things for the right reasons

  8. Informed oversight: We can improve recursive alignment schemes like IDA by having each step include checking the system is actually aligned

    1. Note: This overlaps a lot with 7. To me, the distinction is that 7 can be also be applied with systems trained non-recursively, eg today’s systems trained with Reinforcement Learning from Human Feedback

  9. Interpretability tools in the loss function: We can directly put an interpretability tool into the training loop to ensure the system is doing things in an aligned way

    1. Ambitious version—the tool is so good that it can’t be Goodharted

    2. Less ambitious—The could be Goodharted, but it’s expensive, and this shifts the inductive biases to favour aligned cognition

  10. Norm setting: If interpretability is easier, there may be expectations that, before a company deploys a system, part of doing due diligence is interpreting the system and checking it does what you want

  11. Enabling regulation: Regulators and policy-makers can create more effective regulations around how aligned AI systems must be if they/​the companies can use tools to audit them

  12. Cultural shift 1: If the field of ML shifts towards having a better understanding of models, this may lead to a better understanding of failure cases and how to avoid them

  13. Cultural shift 2: If the field expects better understanding of how models work, it’ll become more glaringly obvious how little we understand right now

    1. Quote: Chris provides the following analogy to illustrate this: if the only way you’ve seen a bridge be built before is through unprincipled piling of wood, you might not realize what there is to worry about in building bigger bridges. On the other hand, once you’ve seen an example of carefully analyzing the structural properties of bridges, the absence of such an analysis would stand out.

  14. Epistemic learned helplessness: Idk man, do we even need a theory of impact? In what world is ‘actually understanding how our black box systems work’ not helpful?

  15. Microscope AI: Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves

  16. Training AIs to interpret other AIs: Even if interpretability is really hard/​labour intensive on advanced systems, if we can create aligned AIs near human level, we can give these interpretability tools and use them to interpret more powerful systems

  17. Forecasting discontinuities: By understanding what’s going on, we can predict how likely we are to see discontinuities in alignment/​capabilities, and potentially detect a discontinuity while training/​before deploying a system

  18. Intervening on training: By interpreting a system during training, we can notice misalignment early on, potentially before it’s good enough for strategies to avoid our notice such as deceptive alignment, gradient hacking, obfuscating its thoughts, etc.

  19. Auditing a training run: By checking for misalignment early in training, we can stop training systems that seem misaligned. This gives us many more shots to make an aligned system without spending large amounts of capital, and eg allows us to try multiple different schemes, initialisations, etc. This essentially shifts the distribution of systems towards alignment.

  20. Eliciting Latent Knowledges: Use the length of the shortest interpretability explanation of behaviours of the model as a training loss for ELK—the idea is that models with shorter explanations are less likely to include human simulations /​ you can tell if they do. (credit to Tao Lin for this one)

Exercise: I think a really useful quick exercise for eliciting inside views is to go through this list and quickly rate each theory of impact out of 10. I ran a small discussion group on these where we did this, you can see the results here—note that these are very off-the-cuff hot takes, and not something you should take seriously or expect participants to stand by later on! The rated order was 1, 4, 3, 6, 8, 18, 2, 7, 16, 19, 17, 10, 9, 12, 14, 13, 15, 11, 5 (ie 1=‘a force-multiplier on alignment research’ is the best). Interestingly, 3 (auditing a fully-trained model) was the most controversial highly rated one.

I’ve made a publicly editable spreadsheet for you to add in your responses!


The most interesting substantive disagreement I found in the discussion was that I was comparably much more excited about using interpretability to audit a trained model, and skeptical of interpretability tools being something that could be directly used in a training process without the resulting optimisation pressure breaking the tool, while other people had the reverse view. My intuition stemmed from the fact that a model doesn’t get any feedback on how its thoughts are internally represented, and so even an incredibly sophisticated deceptive model which is impossible to detect via the outputs may be easy to detect via interpretability tools (analogy—if I knew that sophisticated aliens were reading my mind, I have no clue how to think deceptive thoughts in a way that evades their tools!). The competing intuition was that maybe it isn’t that hard to think non-deceptive thoughts, if you realise that someone will be scanning your mind, by eg developing the instinct to follow cached thoughts without thinking about them, and in occasional moments of deception to generate a lot of cached thoughts that will lead to misaligned actions, but with no deceptive cognition in the moment. And that, instead, putting these tools in the training process could steer gradient descent away from the regions of model space that contain deceptive models at all—even if a sophisticated deceiver could reward hack the tool, and unsophisticated one couldn’t, and we only get a sophisticated deceiver by going via an unsophisticated one.

I’d love to hear other people’s ratings and why! And any arguments that you think I’ve missed.