Yeah, I’m also surprised by it. I have two hypotheses, but it could be for other reasons I’m missing. One hypothesis is that we kept temperature=1 for the KL divergence, and using a different temperature might be important to distill faster. The second is that we undertrained the pretrained models, so pretraining was shorter while distillation took around the same amount of time. I’m not really sure though.
Yeah, I’m also surprised by it. I have two hypotheses, but it could be for other reasons I’m missing. One hypothesis is that we kept temperature=1 for the KL divergence, and using a different temperature might be important to distill faster. The second is that we undertrained the pretrained models, so pretraining was shorter while distillation took around the same amount of time. I’m not really sure though.