Embedding norm is a proxy with many conflated factors, you’d wanna run ablations instead of using it as conclusive.
Also, the unused tokens → weight decay assumes embeddings had decoupled decay and werent tied to the LM head, and no input-output tying. Does the model card specify details on this? Otherwise we can’t assume so.
Embedding norm is a proxy with many conflated factors, you’d wanna run ablations instead of using it as conclusive.
Also, the unused tokens → weight decay assumes embeddings had decoupled decay and werent tied to the LM head, and no input-output tying. Does the model card specify details on this? Otherwise we can’t assume so.