I’ve only skimmed the report so far, but it seems very interesting. Most interpretability work assumes an externally trained model not explicitly made to be interpretable.
Focus on “background” knowledge such as “Paris is the capital of France”, rather than knowledge about the current context such as “the camera has been hacked”.
Only investigate externally trained models. I.e., no explicit training to make latent knowledge more accessible.
Knowledge Neurons in Pretrained Transformers is able to identify particular neurons whose activations correspond to human-interpretable knowledge such as “Paris is the capital of France”. They can partially erase or enhance the influence such pieces of knowledge have on the model’s output by changing the activations of those neurons.
Transformer Feed-Forward Layers Are Key-Value Msemorie is somewhat like “circuits for transformers”. It shows how attention outputs act as “keys” which identify syntactic or semantic patterns in the inputs. Then, the feed forward layer’s “values” are triggered by particular keys and focus probability mass on tokens that tend to appear after the keys in question. The paper also explores how the different layers interact with eachother and the residuals to generate the final token distribution.
One question I’m interested in is if it’s possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.
I’m very interested in interpretability (and have read those papers in particular). We discuss the connection between ELK and interpretability in this appendix. Our main question is how complex the “interpretation” of neural networks must be in order to extract what the models know. If they become quite complex, then it starts to become hard to judge whether a given interpretation is correct (and hence revealing structure inside the model) or simply making up the structure and relationships that the researchers were looking for with their tools. If the interpretations are simple, then we hope that the kinds of regularization described in this document would have an easy time picking out the direct translator.
One question I’m interested in is if it’s possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.
We are open to changing the training strategy for the underlying predictor in order to make it more interpretable, but we’re very scared about approaches like changing regularization. The basic issue is that in the worst case those changes can greatly impact the predictor’s performance. So within our research framework, if we change the loss function for the underlying predictor then we need to be able to argue that it won’t impact the predictor’s performance.
And that problem is quite fundamental in this case, since e.g. highly polysemantic neurons may simply be more performant. That means in the worst case you just need to be able to handle them.
(Outside of our research methodology, I’m also personally much more interested in techniques that can disentangle polysemantic neurons rather than trying to discourage them.)
Ensuring interpretable models remain competitive is important. I’ve looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout’s contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions).
I think there’s a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you’d expect from deep learning interpretability research. Given our limitations in accessing/manipulating the brain’s internal state, something like brain stimulation reward seems like it should be basically impossible, if the brain were as uninterpretable as current deep nets.
I’ve only skimmed the report so far, but it seems very interesting. Most interpretability work assumes an externally trained model not explicitly made to be interpretable.
Are you familiar with interpretability work such as “Knowledge Neurons in Pretrained Transformers” (GitHub) or “Transformer Feed-Forward Layers Are Key-Value Msemorie” (GitHub)? They’re a bit different because they:
Focus on “background” knowledge such as “Paris is the capital of France”, rather than knowledge about the current context such as “the camera has been hacked”.
Only investigate externally trained models. I.e., no explicit training to make latent knowledge more accessible.
Knowledge Neurons in Pretrained Transformers is able to identify particular neurons whose activations correspond to human-interpretable knowledge such as “Paris is the capital of France”. They can partially erase or enhance the influence such pieces of knowledge have on the model’s output by changing the activations of those neurons.
Transformer Feed-Forward Layers Are Key-Value Msemorie is somewhat like “circuits for transformers”. It shows how attention outputs act as “keys” which identify syntactic or semantic patterns in the inputs. Then, the feed forward layer’s “values” are triggered by particular keys and focus probability mass on tokens that tend to appear after the keys in question. The paper also explores how the different layers interact with eachother and the residuals to generate the final token distribution.
One question I’m interested in is if it’s possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.
I’m very interested in interpretability (and have read those papers in particular). We discuss the connection between ELK and interpretability in this appendix. Our main question is how complex the “interpretation” of neural networks must be in order to extract what the models know. If they become quite complex, then it starts to become hard to judge whether a given interpretation is correct (and hence revealing structure inside the model) or simply making up the structure and relationships that the researchers were looking for with their tools. If the interpretations are simple, then we hope that the kinds of regularization described in this document would have an easy time picking out the direct translator.
We are open to changing the training strategy for the underlying predictor in order to make it more interpretable, but we’re very scared about approaches like changing regularization. The basic issue is that in the worst case those changes can greatly impact the predictor’s performance. So within our research framework, if we change the loss function for the underlying predictor then we need to be able to argue that it won’t impact the predictor’s performance.
And that problem is quite fundamental in this case, since e.g. highly polysemantic neurons may simply be more performant. That means in the worst case you just need to be able to handle them.
(Outside of our research methodology, I’m also personally much more interested in techniques that can disentangle polysemantic neurons rather than trying to discourage them.)
Ensuring interpretable models remain competitive is important. I’ve looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout’s contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions).
I think there’s a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you’d expect from deep learning interpretability research. Given our limitations in accessing/manipulating the brain’s internal state, something like brain stimulation reward seems like it should be basically impossible, if the brain were as uninterpretable as current deep nets.