Garrett Baker comments on Basin broadness depends on the size and number of orthogonal features

Garrett Baker 28 Aug 2022 21:57 UTC
1 point
0
This is very very cool! At the end you say that looking at the gradient of the layer outputs is a change of basis. I don’t understand why this is the case. Each layer, taking (in a fully connected network) has on the order of $n^{2}$ parameters if it outputs a vector of length $n$ , so the gradient of that vector with respect to the parameters will have $n^{2}$ entries. What did you have in mind when you said “change of basis”?
- CallumMcDougall 29 Aug 2022 4:21 UTC
  2 points
  0
  Parent
  I think the key point here is that we’re applying a linear transformation to move from neuron space into feature space. Sometimes neurons and features do coincide and you can actually attribute particular concepts to neurons, but unless the neurons are a privileged basis there’s no reason to expect this in general. We’re taking the definition of feature here as a linear combination of neurons which represents some particular important and meaningful (and hopefully human-comprehensible) concept.
- Lucius Bushnaq 29 Aug 2022 6:07 UTC
  1 point
  0
  Parent
  You take the gradient with respect to any preactivation of the next layer. Shouldn’t matter which one. That gets you a length n vector. Since the weights are linear, and we treat biases as an extra node of constant activation, the vector does not depend on which preactivation you chose.
  The idea is to move to a basis in which there is no redundancy or cancellation between nodes, in a sense. Every node encodes one unique feature that means one unique thing.