Daniel Murfet comments on Ambiguous out-of-distribution generalization on an algorithmic task

Daniel Murfet 16 Feb 2025 8:22 UTC
5 points
0
We then record the training loss, weight norm, and estimated LLC of all models at the end of training
For what it’s worth, Edmund Lau has some experiments where he’s able to get models to grok and find solutions with very large weight norm (but small LLC). I am not tracking the grokking literature closely enough to know how surprising/interesting this is.
- Louis Jaburi 16 Feb 2025 17:07 UTC
  1 point
  0
  Parent
  We haven’t seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.
  I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.