Second-order optimisers. Sophia halves steps and total FLOP on GPT-style pre-training, while Lion reports up to 5× savings on JFT, conservatively counted as 1.5–2.
I think it’s kinda commonly accepted wisdom that the heuristic you should have for optimizers that claim savings like this is “They’re probably bullshit,” at least until they get used in big training runs.
Like I don’t have a specific source for this, but a lot of optimizers claiming big savings are out there and few get adopted.
I think it’s kinda commonly accepted wisdom that the heuristic you should have for optimizers that claim savings like this is “They’re probably bullshit,” at least until they get used in big training runs.
Like I don’t have a specific source for this, but a lot of optimizers claiming big savings are out there and few get adopted.