The negative result tells us that the strong form of the claim “regularization = navigability” is probably wrong. Having a smaller weight norm actually is good for generalization (just as the learning theorists would have you believe). You’ll have better luck moving along the set of minimum loss weights in the way that minimizes the norm than in any other way.
Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:
That being said, it’s possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.
I think Omnigrok looked at enough tasks (MNIST, group composition, IMDb reviews, molecule polarizability) to suggest that the weight norm is an important ingredient and not just a special case / cherry-picking.
That said, I still think there’s a good chance it isn’t the whole story. I’d love to explore a task that generalizes at large weight norms, but it isn’t obvious to me that you can straightforwardly construct such a task.
Have you seen the Omnigrok work? It directly argues that weight norm is directly related to grokking:
Similarly, Figure 7 from https://arxiv.org/abs/2301.05217 also makes this point, but less strongly:
That being said, it’s possible that both group composition tasks (like the mod add stuff) and MNIST are pretty special datasets, in that generalizing solutions have small weight norm and memorization solutions have large weight norm. It might be worth constructing tasks where generalizing solutions have large weight norm, and seeing what happens.
I think Omnigrok looked at enough tasks (MNIST, group composition, IMDb reviews, molecule polarizability) to suggest that the weight norm is an important ingredient and not just a special case / cherry-picking.
That said, I still think there’s a good chance it isn’t the whole story. I’d love to explore a task that generalizes at large weight norms, but it isn’t obvious to me that you can straightforwardly construct such a task.