Low that any machine learning algorithm can do this
High that a network that small can do this with regularization and/or weight decay.
Extremely high that, as they imply in the paper, that this works on a network not being trained with weight decay or regularization.
As I describe above, I think grokking requires some mechanism to disrupt the shallow patterns so general patterns can take their place. Regularization / weight decay both do this, but so does the stochasticity of SGD steps.
Grokking still happens with no regularization and full batch updates, but the level of grokking is greatly reduced. In that case, I suspect that non-optimal sizes for the update steps of each parameter act as a type of very weak stochasticity. Possibly, perfectly tuning the learning rate for each parameter on each step would prevent grokking.
Level of surprisal:
Low that any machine learning algorithm can do this High that a network that small can do this with regularization and/or weight decay. Extremely high that, as they imply in the paper, that this works on a network not being trained with weight decay or regularization.
As I describe above, I think grokking requires some mechanism to disrupt the shallow patterns so general patterns can take their place. Regularization / weight decay both do this, but so does the stochasticity of SGD steps. Grokking still happens with no regularization and full batch updates, but the level of grokking is greatly reduced. In that case, I suspect that non-optimal sizes for the update steps of each parameter act as a type of very weak stochasticity. Possibly, perfectly tuning the learning rate for each parameter on each step would prevent grokking.