Aren’t we leaving performance on the table? Yes! We are. But I think that’s fine! There’s always a tradeoff here. E.g. quantization. It’s strictly worse to use lower-precision! But we do it to optimize TCO of the system.
But we can use $INSERT_TECHNIQUE to make models cheaper! Yes, but they should scale for all of these (distillation, quantization, etc.). So we should be using all techniques to make our models easier to serve, and also training them longer.
If you’re training a LLM with the goal of deploying it to users, you should prefer training a smaller model well into the diminishing returns part of the loss curve.
To reiterate my points from Twitter: Timbers is answering a question that is irrelevant and not the one he claims to be answering (and I am annoyed that some throwaway comments at the end is still all the acknowledgement he gives to the fact that the whole post is irrelevant). No one cares about the size of the raw trained model, if they care about inference; they care about the size of the best model they can obtain which fits in their inference envelope, which may be obtainable from a much larger raw model, since those models will reach much lower loss (by definition) and thus can be a big win after some modest damage from the $INSERT_TECHNIQUE. (If you can train the compute-optimal model to 10% lower loss than the overtrained small model, and then lose 1-2% to quantization, well, that’s still a big win of 8% for free by ignoring his advice.)
Timbers continues to ignore the many scaling papers on sparsification and distillation and quantization, which are not hard to find and have been much discussed among those who care about TCO. So his conclusion is at best unproven: it is not obvious that you should prefer training a small model far beyond compute-optimal instead of a large compute-optimal model that you then do (often extremely easy) optimizations to like quantization. If he was doing calculations on that, and even going beyond that to consider questions about continual learning per jcannell or how well it tolerates downstream finetuning or RLHF training (larger=better presumably, so have to consider that), that would be interesting. But he’s not.
To reiterate my points from Twitter: Timbers is answering a question that is irrelevant and not the one he claims to be answering (and I am annoyed that some throwaway comments at the end is still all the acknowledgement he gives to the fact that the whole post is irrelevant). No one cares about the size of the raw trained model, if they care about inference; they care about the size of the best model they can obtain which fits in their inference envelope, which may be obtainable from a much larger raw model, since those models will reach much lower loss (by definition) and thus can be a big win after some modest damage from the $INSERT_TECHNIQUE. (If you can train the compute-optimal model to 10% lower loss than the overtrained small model, and then lose 1-2% to quantization, well, that’s still a big win of 8% for free by ignoring his advice.)
https://arxiv.org/abs/2002.11794
Timbers continues to ignore the many scaling papers on sparsification and distillation and quantization, which are not hard to find and have been much discussed among those who care about TCO. So his conclusion is at best unproven: it is not obvious that you should prefer training a small model far beyond compute-optimal instead of a large compute-optimal model that you then do (often extremely easy) optimizations to like quantization. If he was doing calculations on that, and even going beyond that to consider questions about continual learning per jcannell or how well it tolerates downstream finetuning or RLHF training (larger=better presumably, so have to consider that), that would be interesting. But he’s not.