These randomly trained models, are they uncertain or confidently wrong on the test data?
My model of what is going on here is that stochastic gradient descent is acting roughly like an MCMC sampling method. It’s producing a random sample from the space of low loss parameters. And that the simpler hypothesis correspond to larger parameter space volumes.
When the network needs to memorize, it needs to use nearly all it’s parameters, meaning a small parameter-space volume. When the network is learning a pattern, it’s only using a small fraction of it’s parameters on the pattern, and the rest of the parameters can be almost anything, so long as they don’t get in the way. This means simple hypothesis have a huge volume in parameter space. (This is basically the lottery ticket hypothesis, and it explains why network distillation is so effective.)
MCMC means sampling from the distribution proportional to so larger parameter space volumes will be more likely to be sampled.
So the network training will choose the simplest hypothesis available.
Grokking makes sense if the simpler hypothesis are sometimes harder for local greedy search methods to find.
These randomly trained models, are they uncertain or confidently wrong on the test data?
My model of what is going on here is that stochastic gradient descent is acting roughly like an MCMC sampling method. It’s producing a random sample from the space of low loss parameters. And that the simpler hypothesis correspond to larger parameter space volumes.
When the network needs to memorize, it needs to use nearly all it’s parameters, meaning a small parameter-space volume. When the network is learning a pattern, it’s only using a small fraction of it’s parameters on the pattern, and the rest of the parameters can be almost anything, so long as they don’t get in the way. This means simple hypothesis have a huge volume in parameter space. (This is basically the lottery ticket hypothesis, and it explains why network distillation is so effective.)
MCMC means sampling from the distribution proportional to so larger parameter space volumes will be more likely to be sampled.
So the network training will choose the simplest hypothesis available.
Grokking makes sense if the simpler hypothesis are sometimes harder for local greedy search methods to find.