Dmitry Vaintrob comments on Dmitry Vaintrob’s Shortform

Dmitry Vaintrob 10 Jan 2025 21:17 UTC
13 points
4
Why you should try degrading NN behavior in experiments.

I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.

I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.

This main point is:

experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some “natural” degradations of the NN’s performance, and certain dials are better than others depending on context.

I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the “koan” itself: i.e., nitpicks about work that isn’t careful about thinking about the scale at which it is performing interpretability.

For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think “interpretability on degraded networks” is important for better interpretability.

Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:
1. You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a “genuine” phenomenon from the program’s point of view.
2. You are messily pulling your NN in the direction of a particular behavior, but also singling out a few “real” internal circuits of the NN that are carrying out this behavior.
Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it’s genuinely hard to tell these two behaviors apart. You might find stuff that “looks” like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an “authentic” circuit are just artefacts of the way you set up the experiment.

Now the idea behind running this experiment at “natural” degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that
- the intervention you are running no longer significantly affects the (naturally degraded) performance
- the observed effect still takes place.
Then what you’ve done is effectively “cleaned up” your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many “new behaviors), in a way that sufficiently reduced the complexity that the behavior you’re seeking is no longer “entangled” with a bunch of other behaviors; this should significantly update you that the behavior is indeed “natural” and not spurious.

This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to “squeeze” the complexity in a controlled way to suitably match the complexity of the circuit (and how it’s embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of “the correct complexity” that explains a behavior, there is in some sense no “complexity room” for other sneaky phenomena to confound it.

In the post, the natural degradation I suggested is the physics-inspired “SLGD sampling” process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping “generally useful” circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are “just adding random noise” (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.
What links here?