In particular, can we use noise to make a model grok even in the absence of regularization (which is currently a requirement to make models grok with SGD)?
Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
Thanks Lawrence! I had missed the slingshot mechanism paper, so this is great!
(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
I totally agree on there being much more interesting tasks than grokking with modulo arithmetic, but it seemed like an easy way to test the premise.
Also worth noting that grokking is pretty hyperparameter sensitive—it’s possible you just haven’t found the right size/form of noise yet!
Worth noting that you can get grokking in some cases without explicit regularization with full batch gradient descent, if you use an adaptive optimizer, due to the slingshot mechanism: https://arxiv.org/abs/2206.04817
Unfortunately, reproducing slingshots reliably was pretty challenging for me; I could consistently get it to happen with 2+ layer transformers but not reliably on 1 layer transformers (and not at all on 1-layer MLPs).
(As an aside, I also think grokking is not very interesting to study—if you want a generalization phenomena to study, I’d just study a task without grokking, and where you can get immediately generalization or memorization depending on hyperparameters.)
As for other forms of noise inducing grokking: we do see grokking with dropout! So there’s some reason to think noise → grokking.
(Source: Figure 28 from https://arxiv.org/abs/2301.05217)
Also worth noting that grokking is pretty hyperparameter sensitive—it’s possible you just haven’t found the right size/form of noise yet!
Thanks Lawrence! I had missed the slingshot mechanism paper, so this is great!
I totally agree on there being much more interesting tasks than grokking with modulo arithmetic, but it seemed like an easy way to test the premise.
I will continue the exploration!