We haven’t seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.
I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.
We haven’t seen that empirically with usual regularization methods, so I assume there must be something special going on with the training set up.
I wonder if this phenomenon is partially explained by scaling up the embedding and scaling down the unembedding by a factor (or vice versa). That should leave the LLC constant, but will change L2 norm.