I think Omnigrok looked at enough tasks (MNIST, group composition, IMDb reviews, molecule polarizability) to suggest that the weight norm is an important ingredient and not just a special case / cherry-picking.
That said, I still think there’s a good chance it isn’t the whole story. I’d love to explore a task that generalizes at large weight norms, but it isn’t obvious to me that you can straightforwardly construct such a task.
I think Omnigrok looked at enough tasks (MNIST, group composition, IMDb reviews, molecule polarizability) to suggest that the weight norm is an important ingredient and not just a special case / cherry-picking.
That said, I still think there’s a good chance it isn’t the whole story. I’d love to explore a task that generalizes at large weight norms, but it isn’t obvious to me that you can straightforwardly construct such a task.