LawrenceC comments on Gradient surfing: the hidden role of regularization

LawrenceC 6 Feb 2023 9:56 UTC
LW: 13 AF: 9
7
AF
In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?

Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).

(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
- LawrenceC 6 Feb 2023 10:02 UTC
  LW: 4 AF: 3
  0
  AF Parent
  As for other forms of noise inducing grokking: we do see grokking with dropout! So there’s some reason to think noise → grokking.
  (Source: Figure 28 from https://arxiv.org/abs/2301.05217)
  Also worth noting that grokking is pretty hyperparameter sensitive—it’s possible you just haven’t found the right size/form of noise yet!
  - Jesse Hoogland 6 Feb 2023 16:53 UTC
    2 points
    0
    Parent
    Thanks Lawrence! I had missed the slingshot mechanism paper, so this is great!
    
    (As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
    
    I totally agree on there being much more interesting tasks than grokking with modulo arithmetic, but it seemed like an easy way to test the premise.
    
    Also worth noting that grokking is pretty hyperparameter sensitive—it’s possible you just haven’t found the right size/form of noise yet!
    
    I will continue the exploration!
- gwern 5 Jun 2024 0:11 UTC
  LW: 3 AF: 2
  0
  AF Parent
  
  Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
  
  Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead ‘skinny’ NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
  - LawrenceC 7 Jun 2024 3:14 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Have you tried instead ‘skinny’ NNs with a bias towards depth,
    I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
    That being said, you’re probably correct that having more layers does seem related to slingshots.
    (Particularly for MLPs, which are notorious for overfitting due to their power.)
    What do you mean by power here?
    - gwern 7 Jun 2024 19:54 UTC
      LW: 4 AF: 4
      0
      AF Parent
      
      What do you mean by power here?
      
      Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.