Thane Ruthenis comments on New OpenAI Paper—Language models can explain neurons in language models

Thane Ruthenis 10 May 2023 20:17 UTC
21 points
11
It is extremely reassuring that this is the sort of project that OpenAI chooses to engage in. I don’t particularly expect much of this approach at the object-level (though it’s obviously the sort of thing that’s worth trying), but it’s a nontrivial large-scale undertaking aimed at alignment, and that OpenAI has bothered with it gives a lot of credence to their claims that they care about alignment at all.
An obvious way to extrapolate this is to:
1. Pick some way of factorizing the problem of interpreting a neural network. E. g., go neuron-by-neuron as was done here, or use the path expansion trick.
2. Analyze a few factors (neurons/paths/etc.) by hand. Meticulously record what you’re doing.
3. Give GPT-4 a bunch of data-analysis plug-ins, an actual interface to GPT-2′s weights, and a description of the procedure you used in (2).
4. Use GPT-4 to analyse every factor the same way you analysed the handful of factors in (2).
  1. Which may involve empirical testing of the interpretations, see e. g. causal scrubbing, ARC’s work on explanations, or the simulated vs. actual activations comparison OpenAI did here.
5. Pore over the resulting data.
  1. If there’s too much, figure out a way to factorize the process of interpreting it, then GOTO 2.
  2. If everything worked as intended, you should have on your hands the interpreted version of the computational graph the network is implementing.
The core idea is that this allows us to dramatically upscale interpretability efforts. I’m pretty sure you’d still need a smart human in the loop to direct the whole process, if the process is to result in anything useful, but now they won’t need to waste their time doing low-level analysis (and the time of top-tier researchers seems like the main bottleneck on a lot of these projects).
You can in theory achieve the same result by hiring a horde of undergrads or something (pretty sure I’d suggested that before, even), but GPT-4 might plausibly be better at this sort of instruction-following, and also faster.
I’m still skeptical of this whole approach — e. g., it may be that it fails at step 2, because...
- … it turns out that analysing any individual factor is unlike analysing any other factor, such that you can’t actually define a general procedure for analysis and hand it off to GPT-4.
- … it turns out that if you’ve figured out how to analyse any factor (and there’s e. g. 1-10 analysis procedures that suffice for the entire NN), you gain no additional useful information from proceeding to analyse every factor.
- … or something like that.
In those cases, GPT-4-carried analysis would just result in a cute mountain of data that’s of no particular value.
Nevertheless, it’s IMO definitely something that’s worth trying, just in case it does turn out to be that easy.