Turning off lights with model editing

Link post

This post advertises an illustrative example of model editing that David Bau is fond of, and which I think should be better known. As a reminder, David Bau is a professor at Northeastern; the vibe of his work is “interpretability + interventions on model internals”; examples include the well-known ROME, MEMIT, and Othello papers.

Consider the problem of getting a generative image model to produce an image of a bedroom containing unlit lamps (i.e. lamps which are turned off). Doesn’t sound particularly interesting. Let’s try the obvious prompts on DALL-E-2.

Prompt: A bedroom with lamps. The lamps are turned off.

Prompt: A dark bedroom. There are lamps, but the lamps are unlit.

Doesn’t work great: my first three attempts only got one borderline hit, at the expense of turning the entire bedroom dark (which isn’t really what I had in mind).

As David Bau tells things, even after putting further effort into engineering an appropriate prompt, they weren’t able to get what they wanted: a normal picture of a bedroom with lamps which are not turned on. (Apparently the captions for images containing unlit lamps don’t often mention the lamps.)

Despite not being able to get what they wanted with prompt engineering, they were able to get what they wanted via interpretability + model editing. Namely, they took a GAN which was trained to produce images of bedrooms, and then:

identified a neuron which seemed to modulate the brightness of the lamps in the generated image
intervened on the networks activations, setting the activation of this neuron so as to produce the desired level of brightness.

The result is shown below.

I like this example because it provides a template in a very toy setting for how model editing could be useful for alignment. The basic structure of this situation is:

We have a model which is not behaving as desired, despite being able to do so in principle^[1] (our model doesn’t output images of bedrooms with unlit lamps, despite being capable of doing so).
Basic attempts at steering the model’s behavior fail (prompt engineering isn’t sufficient).
But we are able to get the behavior that we want by performing a targeted model edit.

This example also showcases some key weaknesses of this technique, which would need to be addressed for model editing to become a viable alignment strategy:

Alignment tax. Looking closely at the image above, you’ll notice that even though the direct light from the lamp is able to modified at will, certain second-order effects aren’t affected (e.g. the light which is reflected off the wall). As David tells things, they also identified a whole suite of ~20 neurons in their GAN which modulated more subtle lighting effects.
Not all behaviors can necessarily be targeted. The images on the lower row above contain two lamps, and these two lamps change their brightness together. The researchers were not able to find a neuron which would allow them to change the brightness of only one lamp in images that contained multiple lamps.
No clear advantages over finetuning. The more obvious thing to do would be to finetune the model to output unlit lamps. As far as I know, no one tried to do that in this case, but I imagine it would work. I’ll leave my most optimistic speculation about why model editing could have advantages over finetuning in certain situations in this footnote^[2], but I don’t currently find this speculation especially compelling.

Overall, I’m not super bullish on the usefulness of model editing for alignment. But I do think it’s intriguing, and it seems to have been useful in at least one case (though not necessarily one which is very analogous to x-risky cases of misalignment). Overall, I think that work like this is a risky bet, with the advantage that some of its failure modes might differ from the failure modes of other alignment techniques.

[Thanks to Xander Davies and Oam Patel for discussion.]

^
That is, I find it very likely that DALL-E-2 “can” produce images of bedrooms with unlit lamps, for whatever reasonable interpretation you’d like of what that “can” means.
^
The general vibe here is “the effectiveness of model editing scales with our ability to generate accurate interpretability hypotheses, whereas the effectiveness of finetuning scales with something different.” In more detail:
* Finetuning relies on being able to effectively evaluate outputs, whereas model editing might not. For example, suppose that we would like to get a language model to write good economics papers. Finetuning for human evaluations of economics papers could jointly optimize for papers which are more true as well as more persuasive. On the other hand, if our interpretability were good enough to separate out the mechanisms for writing truly vs. writing persuasively, then we could hope to make a model edit which results in more true but not more persuasive papers output.
* Finetuning might require large datasets, whereas model editing might not. Again in the economics paper writing example, finetuning a model to write good economics papers might require a large number of evaluations. On the other hand, we might be able to generate accurate interpretability hypotheses using only a small number of dataset examples (and possibly no dataset examples at all).
* Model editing might transfer better out-of-distribution. If the interpretability hypothesis on which we base a model edit is good, then the model edit ought to have the predicted effect on a broad distribution of inputs. On the other hand, fine-tuning only guarantees that we get the desired behavior on the fine-tuning distribution.