Interpretability isn’t Free

Epistemic status: This post uses a simple and intuitive model^[1], which seems obvious in retrospect, but it hasn’t been formalized and I’m not confident in its conclusions.

Current consensus is that large language models will be a core component of transformative AI systems. If that’s the case, interpretability for language models is probably an important piece of solving alignment. Anthropic’s recent Transformer Circuits Thread is what I consider the state of the art in the field. They’ve done well to make quick progress, but I think their work also hints at an insoluble core to the problem.

In the most recent article, Softmax Linear Units, the authors replace the GeLU activation function in their transformer model with a new Softmax Linear Unit (SoLU). This is designed to discourage polysemanticity / superposition, in which a single neuron takes on multiple unrelated meanings. However,

SoLU is a double-edged sword for interpretability. On the one hand, it makes it much easier to study a subset of MLP layer features which end up nicely aligned with neurons. On the other hand, we suspect that there are many other non-neuron-aligned features which are essential to the loss and arguably harder to study than in a regular model. Perhaps more concerningly, if one only looked at the SoLU activation, it would be easy for these features to be invisible and create a false sense that one understands all the features.

Gwern responded on LessWrong:

Extremely concerning for safety. The only thing more dangerous than an uninterpretable model is an ‘interpretable’ model. Is there an interpretability tax such that all interpretability methods wind up incentivizing covert algorithms, similar to how CycleGAN is incentivized to learn steganography, and interpretability methods risk simply creating mesa-optimizers which optimize for a superficially-simple seeming ‘surface’ network while it gets the real work done elsewhere out of sight?

The theory is that because polysemanticity hinders interpretability, it makes sense to attempt to prevent it. However, requiring the same performance (loss) from the modified architecture suggests that we’re expecting to get interpretability for free (without losing any capability). Is this a reasonable expectation? Possibly not, since the network has been trained under the optimization pressure of an entire TPU pod to maximize use of every neuron.

In other words, if we accept the superposition hypothesis, the network has shown that it’s useful to have more concepts than neurons, so it can’t possibly assign all neurons a single meaning. If concepts mapping cleanly to neurons have a lot to do with interpretability, then perfect interpretability is too much to ask from this architecture.

Interpretability isn’t free^[2]. While holding model size and architecture^[3] fixed, you ought to expect interpretability to cost performance. Alternatively, it might also be possible to maintain performance, but then you’d need a method for untangling the concepts which are in superposition. This would increase model size^[4]. Worse, I don’t know of any method for doing this.

As an aside, there’s no natural limit to the number of useful concepts (The training process is incentivized to slice concepts ever finer. Each refinement saves just a tiny amount of loss, but across thousands or millions of concepts this adds up.), so there’s no way to prevent polysemanticity by simply adding more neurons.

This highlights the need for interpretability tools for focusing on “unhappy” cases (e.g. adversarial examples and polysemanticity) as opposed to happy cases, since with anything like current architectures these unhappy cases are unavoidable^[5]. There is some interpretability work on adversarial examples for image models, but I believe this is an underexplored and important space for language models.

^
Though I haven’t seen it mentioned anywhere
^
Or more precisely, interpretability probably isn’t free for the current language model paradigm.
^
There’s a bit of subtlety here—I don’t have a formal definition for what exactly it means to be the same architecture. For example, I’m considering a model with each GeLU replaced by SoLU to be the same.
^
Potentially exponentially, since the model could encode different concepts with each unique combination of neuron activations in a layer
^
Even better, though far more ambitious, would be better architectures