I haven’t read that deeply into this yet, but my first reaction is that I don’t see what this gains you compared to a perspective in which the functions mapping the inputs of the network to the activations of the layers are regarded as the network’s elementary units.
Unless I’m misunderstanding something, when you look at the entire network f(x), where x is the input, each polytope of f(x) with its affine transformation corresponds to one of the linear segments of f(x). Same with looking at, say, the polytopes mapping layer 0 to layer L. You can just look at fL(x0), where fL are the activations in layer L, and each linear segment of that should correspond to a polytope.
However, I don’t really see how you’d easily extend the polytope formulation to activation functions that aren’t piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.
In the functional analysis view, a “feature” is a description of a set of inputs that makes a particular element in a given layer’s function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads. But there’s a lot more to know about a function f than what max({f(x)|x∈X}) is.
When you scale up a particular feature in a layer past its activation range in the training dataset, you are effectively querying the functions in subsequent layers outside the domains they’ve been trained to fit well. Instead of checking how many polytope boundaries you crossed, you can just check how much f′(x) varied between your start and end points.
Scaling up some of the activations in a layer by a constant factor means you’re increasing the norm of the corresponding functions, changing the principal component basis of the layer’s function space. So it shouldn’t be surprising if subsequent layers get messed up by that.
However, I don’t really see how you’d easily extend the polytope formulation to activation functions that aren’t piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.
Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274
In the case of GELU and some similar activation functions, you’d need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1).
There’s some further explanation in Appendix C!
In the functional analysis view, a “feature” is a description of a set of inputs that makes a particular element in a given layer’s function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads.
This, indeed, is the assumption we wish to relax.
But there’s a lot more to know about a function f than what max({f(x) | x \in X}) is.
Agreed!
Scaling up some of the activations in a layer by a constant factor means you’re increasing the norm of the corresponding functions, changing the principal component basis of the layer’s function space. So it shouldn’t be surprising if subsequent layers get messed up by that.
There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it’s surprising when you view features as directions and don’t qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a ‘distribution of validity’ in this post).
I haven’t read that deeply into this yet, but my first reaction is that I don’t see what this gains you compared to a perspective in which the functions mapping the inputs of the network to the activations of the layers are regarded as the network’s elementary units.
Unless I’m misunderstanding something, when you look at the entire network f(x), where x is the input, each polytope of f(x) with its affine transformation corresponds to one of the linear segments of f(x). Same with looking at, say, the polytopes mapping layer 0 to layer L. You can just look at fL(x0), where fL are the activations in layer L, and each linear segment of that should correspond to a polytope.
However, I don’t really see how you’d easily extend the polytope formulation to activation functions that aren’t piecewise linear, like tanh or logits, while the functional analysis perspective can handle that pretty easily. Your functions just become smoother.
In the functional analysis view, a “feature” is a description of a set of inputs that makes a particular element in a given layer’s function space take activation values close to their maximum value. E.g., some linear combination of neurons in a layer is most activated by pictures of dog heads. But there’s a lot more to know about a function f than what max({f(x)|x∈X}) is.
When you scale up a particular feature in a layer past its activation range in the training dataset, you are effectively querying the functions in subsequent layers outside the domains they’ve been trained to fit well. Instead of checking how many polytope boundaries you crossed, you can just check how much f′(x) varied between your start and end points.
Scaling up some of the activations in a layer by a constant factor means you’re increasing the norm of the corresponding functions, changing the principal component basis of the layer’s function space. So it shouldn’t be surprising if subsequent layers get messed up by that.
Thanks for your comment!
Extending the polytope lens to activation functions such as sigmoids, softmax, or GELU is the subject of a paper by Baleistriero & Baraniuk (2018) https://arxiv.org/abs/1810.09274
In the case of GELU and some similar activation functions, you’d need to replace the binary spine-code vectors with vectors whose elements take values in (0, 1).
There’s some further explanation in Appendix C!
This, indeed, is the assumption we wish to relax.
Agreed!
There are many lenses that let us see how unsurprising this experiment was, and this is another one! We only use this experiment to show that it’s surprising when you view features as directions and don’t qualify that view by invoking a distribution of activation magnitude where semantics is still valid (called a ‘distribution of validity’ in this post).