In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?
Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
Thanks Lawrence! I had missed the slingshot mechanism paper, so this is great!
(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
I totally agree on there being much more interesting tasks than grokking with modulo arithmetic, but it seemed like an easy way to test the premise.
Also worth noting that grokking is pretty hyperparameter sensitive—it’s possible you just haven’t found the right size/form of noise yet!
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead ‘skinny’ NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
Have you tried instead ‘skinny’ NNs with a bias towards depth,
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
(Particularly for MLPs, which are notorious for overfitting due to their power.)
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.
Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
As for other forms of noise inducing grokking: we do see grokking with dropout! So there’s some reason to think noise → grokking.
(Source: Figure 28 from https://arxiv.org/abs/2301.05217)
Also worth noting that grokking is pretty hyperparameter sensitive—it’s possible you just haven’t found the right size/form of noise yet!
Thanks Lawrence! I had missed the slingshot mechanism paper, so this is great!
I totally agree on there being much more interesting tasks than grokking with modulo arithmetic, but it seemed like an easy way to test the premise.
I will continue the exploration!
Shallow/wide NNs seem to be bad in a lot of ways. Have you tried instead ‘skinny’ NNs with a bias towards depth, which ought to have inductive biases towards more algorithmic, less memorization-heavy solutions? (Particularly for MLPs, which are notorious for overfitting due to their power.)
I haven’t—the problem with skinny NNs is stacking MLP layers quickly makes things uninterpretable, and my attempts to reproduce slingshot → grokking were done with the hope of interpreting the model before/after the slingshots.
That being said, you’re probably correct that having more layers does seem related to slingshots.
What do you mean by power here?
Just a handwavy term for VC dimension, expressivity, number of unique models, or whatever your favorite technical reification of “can be real smart and learn complicated stuff” is.