(caveat they initially distil from a much larger model, which I see as a little bit of a cheat)
Another little bit of a cheat is that they only train Qwen2.5-Math-7B according to the procedure described. In contrast, for the other three models (smaller than Qwen2.5-Math-7B), they instead use the fine-tuned Qwen2.5-Math-7B to generate the training data to bootstrap round 4. (Basically, they distill from DeepSeek in round 1 and then they distill from fine-tuned Qwen in round 4.)
They justify:
Due to limited GPU resources, we performed 4 rounds of self-evolution exclusively on Qwen2.5-Math-7B, yielding 4 evolved policy SLMs (Table 3) and 4 PPMs (Table 4). For the other 3 policy LLMs, we fine-tune them using step-by-step verified trajectories generated from Qwen2.5-Math-7B’s 4th round. The final PPM from this round is then used as the reward model for the 3 policy SLMs.
TBH I’m not sure how this helps them with saving on GPU resources. For some reason it’s cheaper to generate a lot of big/long rollouts with the Qwen2.5-Math-7B-r4 than three times with [smaller model]-r3?)
It doesn’t make sense to me either, but it does seem to invalidate the “bootstrapping” results for the other 3 models. Maybe it’s because they could batch all reward model requests into one instance.
When MS doesn’t have enough compute to do their evals, the rest of us may struggle!
Another little bit of a cheat is that they only train Qwen2.5-Math-7B according to the procedure described. In contrast, for the other three models (smaller than Qwen2.5-Math-7B), they instead use the fine-tuned Qwen2.5-Math-7B to generate the training data to bootstrap round 4. (Basically, they distill from DeepSeek in round 1 and then they distill from fine-tuned Qwen in round 4.)
They justify:
TBH I’m not sure how this helps them with saving on GPU resources. For some reason it’s cheaper to generate a lot of big/long rollouts with the Qwen2.5-Math-7B-r4 than three times with [smaller model]-r3?)
It doesn’t make sense to me either, but it does seem to invalidate the “bootstrapping” results for the other 3 models. Maybe it’s because they could batch all reward model requests into one instance.
When MS doesn’t have enough compute to do their evals, the rest of us may struggle!