Lucius Bushnaq comments on Basin broadness depends on the size and number of orthogonal features

Lucius Bushnaq 29 Aug 2022 6:07 UTC
1 point
0
You take the gradient with respect to any preactivation of the next layer. Shouldn’t matter which one. That gets you a length n vector. Since the weights are linear, and we treat biases as an extra node of constant activation, the vector does not depend on which preactivation you chose.
The idea is to move to a basis in which there is no redundancy or cancellation between nodes, in a sense. Every node encodes one unique feature that means one unique thing.