Impact stories for model internals: an exercise for interpretability researchers

Inspired by Neel’s longlist; thanks to @Nicholas Goldowsky-Dill and @Sam Marks for feedback and discussion, and thanks to AWAIR attendees for participating in the associated activity.

As part of the Alignment Workshop for AI Researchers in July/​August ’23, I ran a session on theories of impact for model internals. Many of the attendees were excited about this area of work, and we wanted an exercise to help them think through what exactly they were aiming for and why. This write-up came out of planning for the session, though I didn’t use all this content verbatim. My main goal was to find concrete starting points for discussion, which

  1. have the right shape to be a theory of impact

  2. are divided up in a way that feels natural

  3. cover the diverse reasons why people may be excited about model internals work

(according to me[1]).

This isn’t an endorsement of any of these, or of model internals research in general. The ideas on this list are due to many people, and I cite things sporadically when I think it adds useful context: feel free to suggest additional citations if you think it would help clarify what I’m referring to.

Summary of the activity

During the session, participants identified which impact stories seemed most exciting to them. We discussed why they felt excited, what success might look like concretely, how it might fail, what other ideas are related, etc. for a couple of those items. I think categorizing existing work based on its theory of impact could also be a good exercise in the future.

I personally found the discussion useful for helping me understand what motivated some of the researchers I talked to. I was surprised by the diversity.

Key stats of an impact story

Applications of model internals vary a lot along multiple axes:

  • Level of human understanding needed for the application[2]

  • Level of rigor or completeness (in terms of % model explained) needed for the application

    • If a high level of rigor or completeness is needed, does that update you on the difficulty of executing in this direction? What does the path to high rigor/​completeness look like? Can you think of modifications to the impact story that might make partial progress be more useful?

    • For example, we get value out of finding adversarial examples or dangerous capabilities, even if the way we find them is somewhat hacky. Meanwhile, if we don’t find them, we’d need to be extremely thorough to be sure they don’t exist, or sufficiently rigorous to get a useful bound on how likely the model is to be dangerous.

  • Is using model internals essential for the application,[3] or are there many possible approaches to the application, only some of which make use of model internals?

    • Steering model behaviors can be done via model editing, or by prompting or finetuning; but, there are reasons (mentioned below) why editing could be a better approach.

Many impact stories (at least as I’ve categorized them) have variants that live at multiple points on these spectra. When thinking about one, you should think about where it lands, and what variants you can think of that might be e.g. easier but still useful.

The list

Some are more fleshed out than others; some of the rest could be fleshed out with a bit more effort, while others are more speculative.

1. Scalable oversight

Looking into the model’s internals could help avoid reward misspecification if it’s easier than evaluating the final output. This is a form of process-based feedback, where the process is the model’s internalized cognition.

The looking and feedback could be coming from a human or from another model. It could be directly in the loss, or just available to inspect when computing the reward. The feedback could be based on how transparent the internals are, how good the transparent parts look, or both.

2. Eliciting latent knowledge

The model may have learned useful knowledge that we would like to make use of, but which we can’t train it to report (e.g. because it is in a superhuman regime).

How is this different from the previous point? ELK could be used for oversight. However, ELK more specifically targets 1) revealing something which was previously inaccessible to the overseer, while e.g. just linearly boosting overseer efficiency could improve oversight, 2) details of what the model is doing that it knows about, while any details of what the model is doing could be used to improve oversight in general, and 3) (often) the worst case rather than merely improving on baselines. We could also make use of the elicited knowledge in ways other than computing the reward for the model.

It’s possible that we would be able to have a great deal of understanding and just “read off” the model’s knowledge. Alternatively, we may need to use the model’s structure more abstractly: e.g. if we find two separate representations of something that seems like it could be our latent of interest, we might make use of the fact that one is more upstream in the model’s computation.[5]

3. Auditing

To catch whether the model learned the right thing even if it does a great job on all training examples. Examples of things we might want to check are absent are social reasoning capability, knowledge of bioweapons, or different behavior in training and deployment. If we find the model has one of these, we might throw out the whole model, just cut out a concerning piece, or do something else. We could also check for things we do want to see.

Model internals may be extra-helpful here for two reasons. We want to notice properties of the model as early as possible, perhaps before they are meaningfully reflected in I/​O behavior. Also, we want to avoid relying on merely observing I/​O behavior in cases when the model may be trying to hide things from us: it’s harder for something bad to look good the more bits you get to observe.[4]

However, other techniques are possible: e.g. if we are worried about manipulation capabilities then we can try capability elicitation: finetune the model to be manipulative, and measure whether it is able to do a good job. (This would address both concerns above about relying on I/​O behavior).

Using model internals for this directly probably requires some understanding (though not necessarily on the target model: e.g. we could maybe understand what social reasoning tends to look like in some models we constructed, and then look for the same pattern in the target model). There are also indirect applications of model internals for auditing, such as...

4. Finding adversarial examples

This is an indirect use of model internals, whose impact is via accelerating other safety work (adversarial training). We can also use the existence of adversarial examples as a form of auditing.

Today there are a lot of “white-box” adversarial attacks which blindly backprop through the network, which I wouldn’t count as really using model internals. Occasionally people construct adversarial examples to validate interpretability findings[6] but as far as I know this approach has not actually been used for measuring or improving model robustness.

Note that finding advexes does not necessarily require understanding: for example, we could train an adversary model as in the Go paper; or we could notice some low level property that leads to finding an advex without actually understanding what is going on.

5. Relaxed adversarial training

It may be intractable to construct examples where the model displays some bad behavior. In that case, perhaps we can relax the adversary’s job to describing a situation in which the model would plausibly behave badly. As in the previous case we can, in addition to training, use this as a form of auditing.

Similarly to the previous example of regular adversarial training, model internals here could be useful for constructing adversarial “inputs” (potentially more so, as the space of possible inputs is much larger). Also, we now need a way to measure, in a way we can back-propagate through, that the model would plausibly behave badly, since we cannot actually run it on the adversarial “input”. We may need model internals for that step.

6. Mechanistic anomaly detection

That is, is the model’s output on this input “for the normal reason”? This means looking for anomalies in the mechanism the model uses, rather than in its I/​O (as OOD detection might do). As a simplified example, it may be that the model usually produces answers because it’s trying to answer questions honestly, but this time it selected an answer which it thought we were most likely to like. We might not know that that’s the underlying explanation (unlike in the ambitious form of auditing), but we could detect that the model is doing something different and flag it for further inspection before acting on its answer.

Why should we care about mechanistic anomalies? Perhaps the distribution shift isn’t detectable in the I/​O at all. Or perhaps it is, but we need to allow some distribution shifts while detecting “unsafe” ones, and leveraging the model internals could distinguish these.

We could also imagine doing this between models or generations (though the techniques may look quite different, and this isn’t referred to as MAD). That is, we could measure whether one model implements a behavior similarly to another, even if we don’t understand either of them.

7. Model editing

We could modify the model’s internals to modify its behavior (steering, unlearning, maybe also capability elicitation, etc.).

Today we do this with finetuning or prompting, but it’s possible that modifying internals could work better. This may be due to data unavailability, difficulty of constructing a prompt (similarly to relaxed adversarial training), or just for random reasons. For a recent example, see the Inference-Time Intervention paper.

8. Scientific understanding

Better understanding could be helpful for prioritization and coordination.

A relatively concrete example would be seeing whether the model’s cognition tends to involve planning/​optimization, or whether it is more like a pile of heuristics. This would provide empirical evidence for/​against threat models, and it may be difficult to get comparable evidence from I/​O alone.

In general, we may be able to somehow make predictions about what capabilities might emerge in models or how different interventions would affect a model.

One reason that using model internals for understanding might be good: facts about internals may turn out to be easier for us to reason about than facts about behavioral evaluations.

9. Communicating AI risk

Lack of interpretability could be a particularly salient way to communicate how little we understand the AI systems we are building and the associated risks. Working on interpretability could be helpful for this if we can show a contrast between “good interpretability we did on a model from 3 years ago” and “model that is being considered for deployment today.”

The opposite might also be true: progress in interpretability could create a false sense of security and reduce appreciation of AI risk.

10. Microscope AI

“Maybe we can avoid deploying agents at all, by training systems to do complex tasks, then interpreting how they do it and doing it ourselves” (from Neel’s list). We can expand “doing it ourselves” to “writing software that we fully understand.”

11. Better inductive bias

We can train models to be more transparent/​interpretable, or to have/​avoid certain thought patterns. This maybe only gets us apparent transparency, but this could still give a better inductive bias for the behavior that the model learns. Note that without specific reasons to think this would happen, this is an example of safety impact via a byproduct of work on model internals, rather than a direct impact story.

  1. ^

    When considering existing lists, sometimes I thought a headline point was important, but for different reasons than those given in the summary; or I thought something made sense as a theory of impact for model internals but wouldn’t require any interpretability; or I thought a single item mixed multiple distinct ideas.

  2. ^

    This is why the title here is “model internals” rather than “interpretability”: the latter sounds to me like it involves human understanding (someone interprets what the internals mean) which we don’t want to assume.

  3. ^

    Here are some reasons internals might be especially suited to a given application (though I don’t claim that they necessarily apply in practice, for applications I know of):

    1. Model internals can give us more bits of information on what the model is doing, making it substantially harder for the model to hide facts from the overseer.

    2. Model internals may be somehow crisper than e.g. the model’s outputs, such that it’s easier to isolate and understand factors we care about.

    3. Model internals may allow finding unknown unknowns that we wouldn’t have thought to look for; though this maybe requires high levels of detail and understanding.

    4. Model internals approaches are just generally different from other approaches, so might fail in uncorrelated ways.

  4. ^

    Powerful enough models could potentially make their internals look rosy as well, but this requires much more advanced capabilities.

  5. ^

    This sentence (and entire section) may be confusing if you are not familiar with ELK. I’m not aiming to fully explain it here. It’s also not the best example of using model internals for ELK, just one that’s relatively easy to say quickly.

  6. ^

    e.g. these papers out from Jacob Andreas’ group