I’m very interested in interpretability (and have read those papers in particular). We discuss the connection between ELK and interpretability in this appendix. Our main question is how complex the “interpretation” of neural networks must be in order to extract what the models know. If they become quite complex, then it starts to become hard to judge whether a given interpretation is correct (and hence revealing structure inside the model) or simply making up the structure and relationships that the researchers were looking for with their tools. If the interpretations are simple, then we hope that the kinds of regularization described in this document would have an easy time picking out the direct translator.
One question I’m interested in is if it’s possible to train models to make these sorts of interpretability techniques easier to use. E.g., I strongly suspect that dropout and L2 regularization make current state of the art models much less interpretable than they otherwise would be because these regularizers prompt the model to distribute its concept representations across multiple neurons.
We are open to changing the training strategy for the underlying predictor in order to make it more interpretable, but we’re very scared about approaches like changing regularization. The basic issue is that in the worst case those changes can greatly impact the predictor’s performance. So within our research framework, if we change the loss function for the underlying predictor then we need to be able to argue that it won’t impact the predictor’s performance.
And that problem is quite fundamental in this case, since e.g. highly polysemantic neurons may simply be more performant. That means in the worst case you just need to be able to handle them.
(Outside of our research methodology, I’m also personally much more interested in techniques that can disentangle polysemantic neurons rather than trying to discourage them.)
Ensuring interpretable models remain competitive is important. I’ve looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout’s contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions).
I think there’s a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you’d expect from deep learning interpretability research. Given our limitations in accessing/manipulating the brain’s internal state, something like brain stimulation reward seems like it should be basically impossible, if the brain were as uninterpretable as current deep nets.
I’m very interested in interpretability (and have read those papers in particular). We discuss the connection between ELK and interpretability in this appendix. Our main question is how complex the “interpretation” of neural networks must be in order to extract what the models know. If they become quite complex, then it starts to become hard to judge whether a given interpretation is correct (and hence revealing structure inside the model) or simply making up the structure and relationships that the researchers were looking for with their tools. If the interpretations are simple, then we hope that the kinds of regularization described in this document would have an easy time picking out the direct translator.
We are open to changing the training strategy for the underlying predictor in order to make it more interpretable, but we’re very scared about approaches like changing regularization. The basic issue is that in the worst case those changes can greatly impact the predictor’s performance. So within our research framework, if we change the loss function for the underlying predictor then we need to be able to argue that it won’t impact the predictor’s performance.
And that problem is quite fundamental in this case, since e.g. highly polysemantic neurons may simply be more performant. That means in the worst case you just need to be able to handle them.
(Outside of our research methodology, I’m also personally much more interested in techniques that can disentangle polysemantic neurons rather than trying to discourage them.)
Ensuring interpretable models remain competitive is important. I’ve looked into the issue for dropout specifically. This paper disentangles the different regularization benefits dropout provides and shows we can recover dropout’s contributions by adding a regularization term to the loss and noise to the gradient updates (the paper derives expressions for both interventions).
I think there’s a lot of room for high performance, relatively interpretable deep models. E.g., the human brain is high performance and seems much more interpretable than you’d expect from deep learning interpretability research. Given our limitations in accessing/manipulating the brain’s internal state, something like brain stimulation reward seems like it should be basically impossible, if the brain were as uninterpretable as current deep nets.