1stuserhere comments on What GPT-oss Leaks About OpenAI’s Training Data

1stuserhere 26 Sep 2025 9:59 UTC
1 point
0
Embedding norm is a proxy with many conflated factors, you’d wanna run ablations instead of using it as conclusive.

Also, the unused tokens → weight decay assumes embeddings had decoupled decay and werent tied to the LM head, and no input-output tying. Does the model card specify details on this? Otherwise we can’t assume so.