Charlie Steiner comments on How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Charlie Steiner 23 Apr 2026 14:57 UTC
2 points
0
Making a story for self-distillation is interesting. Does sheer regularization on a selected dataset really lead the model to make the “obvious” generalization faster than it loses unrelated unused capabilities?
E.g. suppose I train on all-caps history facts, and my optimizer is mostly saying “reduce the size of weights while keeping the prediction the same”. Will it learn to talk in all caps faster than it forgets science facts? If so, why does that cause a bigger decrease in weights while keeping the prediction the same?