I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5
I agree! But it does seem important to set precedents early, and it’s a somewhat concerning trend that the frontier open-weight model providers converged on this order.
And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training → simple instruction training → preference training → RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.
I agree! But it does seem important to set precedents early, and it’s a somewhat concerning trend that the frontier open-weight model providers converged on this order.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training → simple instruction training → preference training → RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.