Yes and yes, basically. Although, to be clear: (i) “according to the teacher” should be “according to the loss used to obtain the teacher,” (ii) the theorem deals with the case of directly distilling on logits, whereas our LLM experiments involve sampling according to the teacher’s logits (which introduces noise), and (iii) the theorem only applies when you finetune on the unmodified teacher distribution—it doesn’t deal with the case where you filter the responses.
Yes and yes, basically. Although, to be clear: (i) “according to the teacher” should be “according to the loss used to obtain the teacher,” (ii) the theorem deals with the case of directly distilling on logits, whereas our LLM experiments involve sampling according to the teacher’s logits (which introduces noise), and (iii) the theorem only applies when you finetune on the unmodified teacher distribution—it doesn’t deal with the case where you filter the responses.