gwern comments on How I’m thinking about GPT-N

gwern 17 Jan 2022 22:04 UTC
8 points

To my knowledge the most used regularization method in deep learning, dropout, doesn’t make models simpler in the sense of being more compressible.

Yes, it does (as should make sense, because if you can drop out a parameter entirely, you don’t need it, and if it succeeds in fostering modularity or generalization, that should make it much easier to prune), and this was one of the justifications for dropout, and that has nice Bayesian interpretations too. (I have a few relevant cites in my sparsity tag.)
- delton137 17 Jan 2022 23:19 UTC
  2 points
  Parent
  The idea that using dropout makes models simpler is not intuitive to me because according to Hinton dropout essentially does the same thing as ensembling. If what you end up with is something equivalent to an ensemble of smaller networks than it’s not clear to me that would be easier to prune.
  
  One of the papers you linked to appears to study dropout in the context of Bayesian modeling and they argue it encourages sparsity. I’m willing to buy that it does in fact reduce complexity/ compressibility but I’m also not sure any of this is 100% clear cut.
  - jacob_cannell 18 Jan 2022 1:01 UTC
    3 points
    Parent
    It’s not that dropout provides some ensembling secret sauce; instead neural nets are inherently ensembles proportional to their level of overcompleteness. Dropout (like other regularizers) helps ensure they are ensembles of low complexity sub-models, rather than ensembles of over-fit higher complexity sub-models (see also: lottery tickets, pruning, grokking, double descent).
  - delton137 17 Jan 2022 23:27 UTC
    1 point
    Parent
    By the way, if you look at Filan et al.’s paper “Clusterability in Neural Networks” there is a lot of variance in their results but generally speaking they find that L1 regularization leads to slightly more clusterability than L2 or dropout.