Hi, I’m Kunvar. I enjoy science and engineering. I’m currently studying the structures learned by trained neural networks and trying to understand what algorithms are implemented by these networks for various tasks.
You can find my various social profiles here, my personal website here, and reach out to me at kunvar@mechinterp.com
Embedding norm is a proxy with many conflated factors, you’d wanna run ablations instead of using it as conclusive.
Also, the unused tokens → weight decay assumes embeddings had decoupled decay and werent tied to the LM head, and no input-output tying. Does the model card specify details on this? Otherwise we can’t assume so.