Quick thought: I wonder if you could still be overestimating parameter counts for models that would be made by distilling a larger teacher down to a smaller student. Any OSS ones you could test this hypothesis on?
Properly done, the methodology should find that sufficiently over trained low parameter models ~= distilled low parameter models, since there isn’t more capacity to memorize. But yeah, that would be another good sanity check to run.
Wait, why are distilled models better than just overtraining the small model again? My guess is it’s mainly because SFT >> RL for efficiency, and cloning good CoTs is easier than sampling them via random exploration.
re: distilled > overtrained, you can distill via on policy distillation (OPD) with the strong model as the teacher, get dense supervision (you wouldn’t get this with RLVR) and get generalization gains (because of the on policy nature of OPD vis a vis SFT, which is a lot more fat handed).
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.
Similarly, I’d be interested in whether there’s a difference between western and Chinese models in this regard. It’s been a long held belief that Chinese models tend to rely more on distilling frontier models, so how do the estimated parameter counts for known Qwen models compare to estimated parameter counts like Gemma 4 for example? Likewise, is there a difference between the dense and almost-equally-sized MoE models in those respective families?
Quick thought: I wonder if you could still be overestimating parameter counts for models that would be made by distilling a larger teacher down to a smaller student. Any OSS ones you could test this hypothesis on?
Properly done, the methodology should find that sufficiently over trained low parameter models ~= distilled low parameter models, since there isn’t more capacity to memorize. But yeah, that would be another good sanity check to run.
Wait, why are distilled models better than just overtraining the small model again? My guess is it’s mainly because SFT >> RL for efficiency, and cloning good CoTs is easier than sampling them via random exploration.
re: distilled > overtrained, you can distill via on policy distillation (OPD) with the strong model as the teacher, get dense supervision (you wouldn’t get this with RLVR) and get generalization gains (because of the on policy nature of OPD vis a vis SFT, which is a lot more fat handed).
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.
Similarly, I’d be interested in whether there’s a difference between western and Chinese models in this regard. It’s been a long held belief that Chinese models tend to rely more on distilling frontier models, so how do the estimated parameter counts for known Qwen models compare to estimated parameter counts like Gemma 4 for example? Likewise, is there a difference between the dense and almost-equally-sized MoE models in those respective families?