TLW comments on Possible research directions to improve the mechanistic explanation of neural networks

TLW 11 Nov 2021 7:14 UTC
1 point
0
Illustration of what non-robust features might look like

This looks… an awful lot like what one would expect to see out of a convolutional network. Small-scale features and textures end up as the main discriminators because they optimize faster due to requiring less layers, and hence outcompete larger-scale classifiers. (To the extent that you can think of training as competition between subnetworks.)

(Sanity check: we haven’t solved the problem of deeper networks taking longer to train, right? I know ReLU helps with vanishing gradients.)

It’s too bad fully-connected networks don’t scale. I’d be interested to see what maximum-activation examples looked like for a fully-connected network.
(Fair warning: I’m definitely in the “amateur” category here. Usual caveats apply—using incorrect terminology, etc, etc. Feel free to correct me.)
- gwern 12 Nov 2021 15:14 UTC
  4 points
  0
  Parent
  
  It’s too bad fully-connected networks don’t scale. I’d be interested to see what maximum-activation examples looked like for a fully-connected network.
  
  They scale these days. See https://www.gwern.net/notes/FC
  - TLW 14 Nov 2021 20:03 UTC
    1 point
    0
    Parent
    Oh interesting!
    I would be interested in seeing what examples in fully-connected networks looked like.
  - delton137 12 Nov 2021 16:26 UTC
    1 point
    0
    Parent
    Nice! wow.
- delton137 12 Nov 2021 13:39 UTC
  1 point
  0
  Parent
  we haven’t solved the problem of deeper networks taking longer to train, right
  
  My understanding is the vanishing gradient problem has been largely mitigated by introducing skip connections (first with resnet, and now standard in CNN architectures), allowing for networks with hundreds of layers.
  
  It’s too bad fully-connected networks don’t scale.
  I’ve heard people say vision transformers are sort of like going back to MLPs for vision. The disadvantage of going away from the CNN architecture (in particular weight sharing across receptive fields) is that you end up with more parameters and thus require a lot more data to train.
  
  I just did a search and came across this: “MLP-Mixer: An all-MLP Architecture for Vision” . Together with “Patches Are All You Need?” the basic theme I’m seeing here is that putting in the prior of focusing on small patches is really powerful. In fact, it may be that the vision transformer can do better than CNNs (with enough data) because this prior is built in, not because of the attention layers. Which is just another example showing the importance of doing rigorous comparisons and ablation studies before jumping to conclusions about what makes architecture X better than architecture Y.
  - TLW 14 Nov 2021 20:10 UTC
    1 point
    0
    Parent
    My understanding is the vanishing gradient problem has been largely mitigated by introducing skip connections (first with resnet, and now standard in CNN architectures), allowing for networks with hundreds of layers.
    Does this actually solve the problem, or just mask it? Skip connections end up with a bunch of shallow networks in parallel with deeper networks, to an over-approximation. If the shallow portions end up training faster and out-competing the deeper portions...