Mateusz Bagiński comments on Implications of the inference scaling paradigm for AI safety

Mateusz Bagiński 17 Jan 2025 14:18 UTC
4 points
1

(caveat they initially distil from a much larger model, which I see as a little bit of a cheat)

Another little bit of a cheat is that they only train Qwen2.5-Math-7B according to the procedure described. In contrast, for the other three models (smaller than Qwen2.5-Math-7B), they instead use the fine-tuned Qwen2.5-Math-7B to generate the training data to bootstrap round 4. (Basically, they distill from DeepSeek in round 1 and then they distill from fine-tuned Qwen in round 4.)

They justify:

Due to limited GPU resources, we performed 4 rounds of self-evolution exclusively on Qwen2.5-Math-7B, yielding 4 evolved policy SLMs (Table 3) and 4 PPMs (Table 4). For the other 3 policy LLMs, we fine-tune them using step-by-step verified trajectories generated from Qwen2.5-Math-7B’s 4th round. The final PPM from this round is then used as the reward model for the 3 policy SLMs.

TBH I’m not sure how this helps them with saving on GPU resources. For some reason it’s cheaper to generate a lot of big/long rollouts with the Qwen2.5-Math-7B-r4 than three times with [smaller model]-r3?)
- wassname 17 Jan 2025 23:46 UTC
  3 points
  2
  Parent
  It doesn’t make sense to me either, but it does seem to invalidate the “bootstrapping” results for the other 3 models. Maybe it’s because they could batch all reward model requests into one instance.
  
  When MS doesn’t have enough compute to do their evals, the rest of us may struggle!