We then record the training loss,weightnorm, and estimatedLLC of all models at the end of training
For what it’s worth, Edmund Lau has some experiments where he’s able to get models to grok and find solutions with very large weight norm (but small LLC). I am not tracking the grokking literature closely enough to know how surprising/interesting this is.
We haven’t seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.
I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.
For what it’s worth, Edmund Lau has some experiments where he’s able to get models to grok and find solutions with very large weight norm (but small LLC). I am not tracking the grokking literature closely enough to know how surprising/interesting this is.
We haven’t seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.
I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.