This is very very cool! At the end you say that looking at the gradient of the layer outputs is a change of basis. I don’t understand why this is the case. Each layer, taking (in a fully connected network) has on the order of n2 parameters if it outputs a vector of length n, so the gradient of that vector with respect to the parameters will have n2 entries. What did you have in mind when you said “change of basis”?
I think the key point here is that we’re applying a linear transformation to move from neuron space into feature space. Sometimes neurons and features do coincide and you can actually attribute particular concepts to neurons, but unless the neurons are a privileged basis there’s no reason to expect this in general. We’re taking the definition of feature here as a linear combination of neurons which represents some particular important and meaningful (and hopefully human-comprehensible) concept.
You take the gradient with respect to any preactivation of the next layer. Shouldn’t matter which one. That gets you a length n vector. Since the weights are linear, and we treat biases as an extra node of constant activation, the vector does not depend on which preactivation you chose.
The idea is to move to a basis in which there is no redundancy or cancellation between nodes, in a sense. Every node encodes one unique feature that means one unique thing.
This is very very cool! At the end you say that looking at the gradient of the layer outputs is a change of basis. I don’t understand why this is the case. Each layer, taking (in a fully connected network) has on the order of n2 parameters if it outputs a vector of length n, so the gradient of that vector with respect to the parameters will have n2 entries. What did you have in mind when you said “change of basis”?
I think the key point here is that we’re applying a linear transformation to move from neuron space into feature space. Sometimes neurons and features do coincide and you can actually attribute particular concepts to neurons, but unless the neurons are a privileged basis there’s no reason to expect this in general. We’re taking the definition of feature here as a linear combination of neurons which represents some particular important and meaningful (and hopefully human-comprehensible) concept.
You take the gradient with respect to any preactivation of the next layer. Shouldn’t matter which one. That gets you a length n vector. Since the weights are linear, and we treat biases as an extra node of constant activation, the vector does not depend on which preactivation you chose.
The idea is to move to a basis in which there is no redundancy or cancellation between nodes, in a sense. Every node encodes one unique feature that means one unique thing.