Illustration of what non-robust features might look like
This looks… an awful lot like what one would expect to see out of a convolutional network. Small-scale features and textures end up as the main discriminators because they optimize faster due to requiring less layers, and hence outcompete larger-scale classifiers. (To the extent that you can think of training as competition between subnetworks.)
(Sanity check: we haven’t solved the problem of deeper networks taking longer to train, right? I know ReLU helps with vanishing gradients.)
It’s too bad fully-connected networks don’t scale. I’d be interested to see what maximum-activation examples looked like for a fully-connected network.
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
It’s too bad fully-connected networks don’t scale. I’d be interested to see what maximum-activation examples looked like for a fully-connected network.
we haven’t solved the problem of deeper networks taking longer to train, right
My understanding is the vanishing gradient problem has been largely mitigated by introducing skip connections (first with resnet, and now standard in CNN architectures), allowing for networks with hundreds of layers.
It’s too bad fully-connected networks don’t scale.
I’ve heard people say vision transformers are sort of like going back to MLPs for vision. The disadvantage of going away from the CNN architecture (in particular weight sharing across receptive fields) is that you end up with more parameters and thus require a lot more data to train.
I just did a search and came across this: “MLP-Mixer: An all-MLP Architecture for Vision” . Together with “Patches Are All You Need?” the basic theme I’m seeing here is that putting in the prior of focusing on small patches is really powerful. In fact, it may be that the vision transformer can do better than CNNs (with enough data) because this prior is built in, not because of the attention layers. Which is just another example showing the importance of doing rigorous comparisons and ablation studies before jumping to conclusions about what makes architecture X better than architecture Y.
My understanding is the vanishing gradient problem has been largely mitigated by introducing skip connections (first with resnet, and now standard in CNN architectures), allowing for networks with hundreds of layers.
Does this actually solve the problem, or just mask it? Skip connections end up with a bunch of shallow networks in parallel with deeper networks, to an over-approximation. If the shallow portions end up training faster and out-competing the deeper portions...
This looks… an awful lot like what one would expect to see out of a convolutional network. Small-scale features and textures end up as the main discriminators because they optimize faster due to requiring less layers, and hence outcompete larger-scale classifiers. (To the extent that you can think of training as competition between subnetworks.)
(Sanity check: we haven’t solved the problem of deeper networks taking longer to train, right? I know ReLU helps with vanishing gradients.)
It’s too bad fully-connected networks don’t scale. I’d be interested to see what maximum-activation examples looked like for a fully-connected network.
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
They scale these days. See https://www.gwern.net/notes/FC
Oh interesting!
I would be interested in seeing what examples in fully-connected networks looked like.
Nice! wow.
My understanding is the vanishing gradient problem has been largely mitigated by introducing skip connections (first with resnet, and now standard in CNN architectures), allowing for networks with hundreds of layers.
I’ve heard people say vision transformers are sort of like going back to MLPs for vision. The disadvantage of going away from the CNN architecture (in particular weight sharing across receptive fields) is that you end up with more parameters and thus require a lot more data to train.
I just did a search and came across this: “MLP-Mixer: An all-MLP Architecture for Vision” . Together with “Patches Are All You Need?” the basic theme I’m seeing here is that putting in the prior of focusing on small patches is really powerful. In fact, it may be that the vision transformer can do better than CNNs (with enough data) because this prior is built in, not because of the attention layers. Which is just another example showing the importance of doing rigorous comparisons and ablation studies before jumping to conclusions about what makes architecture X better than architecture Y.
Does this actually solve the problem, or just mask it? Skip connections end up with a bunch of shallow networks in parallel with deeper networks, to an over-approximation. If the shallow portions end up training faster and out-competing the deeper portions...