Hdot comments on Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream

Hdot 10 Sep 2024 9:35 UTC
1 point
0
Interesting find! Is this resolved by just using layer normalisation to normalise the activations of along channels? That way we could keep our adaptive learning rates but smoothen the distribution of activations and weights.
- zeni 20 May 2026 0:02 UTC
  1 point
  0
  Parent
  You can prevent absolute scale from blowing up but I don’t think this fixes the non-normal distributional statistics within channels.