leogao comments on leogao’s Shortform

leogao 17 Dec 2023 7:17 UTC
6 points
current understanding of optimization
- high curvature directions (hessian eigenvectors with high eigenvalue) want small lrs. low curvature directions want big lrs
- if the lr in a direction is too small, it takes forever to converge. if the lr is too big, it diverges by oscillating with increasing amplitude
- momentum helps because if your lr is too small, it makes you move a bit faster. if your lr is too big, it causes the oscillations to cancel out with themselves. this makes high curvature directions more ok with larger lrs and low curvature directions more ok with smaller lrs, improving conditioning
- high curvature directions also have bigger gradients. this is the opposite of what we want because in a perfect world higher curvature directions would have smaller gradients (natural gradient does this but it’s usually too expensive). adam second moment / rmsprop helps because it makes gradients stay exactly the same size when the direction gets bigger, which is sorta halfway right
  - applied per param rather than per eigenvector
- in real NNs edge of stability means it’s actually even more fine to have a too-high lr: the max curvature increases throughout training until it gets to the critical point where it would diverge, but then instead of diverging all the way the oscillations along the top eigenvector somehow cause the model to move into a slightly lower curvature region again, so that it stabilizes right at the edge of stability.
  - for Adam, these oscillations also cause second moment increases, which decreases preconditioned max curvature without affecting the original curvature. so this means the original max curvature can just keep increasing for Adam whereas it doesn’t for SGD (though apparently there’s some region where it jumps into a region with low original max curvature too)
papers
- https://distill.pub/2017/momentum/ really cool momentum explainer
- https://arxiv.org/abs/2103.00065 - edge of stability
- https://arxiv.org/abs/2207.14484 - edge of stability for adam
- RHollerith 17 Dec 2023 20:16 UTC
  2 points
  Parent
  What does “the lr” mean in this context?
  - leogao 17 Dec 2023 20:19 UTC
    2 points
    Parent
    learning rate