What does the chinchilla scaling laws paper (overtraining small models) have to do with distilling larger models? It’s about optimizing the performance of your best model, not inference costs. The compute optimal small model would presumably be a better thing to distill, since the final quality is higher.
“Overtraining” isn’t Chinchilla; Chinchilla is just “training”. The overtraining being advocated was supra-Chinchilla, with the logic that while you were going off the compute-optimal training, sure, you were more than making up for it by your compute-savings in the deployment phase, which the Chinchilla scaling laws do not address in any way. So there was a fad for training small models for a lot longer.
What does the chinchilla scaling laws paper (overtraining small models) have to do with distilling larger models? It’s about optimizing the performance of your best model, not inference costs. The compute optimal small model would presumably be a better thing to distill, since the final quality is higher.
“Overtraining” isn’t Chinchilla; Chinchilla is just “training”. The overtraining being advocated was supra-Chinchilla, with the logic that while you were going off the compute-optimal training, sure, you were more than making up for it by your compute-savings in the deployment phase, which the Chinchilla scaling laws do not address in any way. So there was a fad for training small models for a lot longer.