Making a story for self-distillation is interesting. Does sheer regularization on a selected dataset really lead the model to make the “obvious” generalization faster than it loses unrelated unused capabilities?
E.g. suppose I train on all-caps history facts, and my optimizer is mostly saying “reduce the size of weights while keeping the prediction the same”. Will it learn to talk in all caps faster than it forgets science facts? If so, why does that cause a bigger decrease in weights while keeping the prediction the same?
Making a story for self-distillation is interesting. Does sheer regularization on a selected dataset really lead the model to make the “obvious” generalization faster than it loses unrelated unused capabilities?
E.g. suppose I train on all-caps history facts, and my optimizer is mostly saying “reduce the size of weights while keeping the prediction the same”. Will it learn to talk in all caps faster than it forgets science facts? If so, why does that cause a bigger decrease in weights while keeping the prediction the same?