Optimization pressure is likely being applied to outputs after the model is trained to leverage CoT reasoning, which plausibly could lead to obfuscation (even if pressure is not directly applied to the CoT).
I agree this would be concerning for AIs which are good at single-forward-pass instrumental reasoning to prevent RLHF from making them “care” a lot about regular human values. I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5 because I don’t expect them to be good enough at instrumental reasoning. And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
2
I don’t understand this. I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default. I think you are more likely to get harmless looking but harmful in content scratchpads if you include harmlessness training from the beginning of training.
I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default.
Yeah this is interesting. However, I think it’s unclear how HHH training (before any reasoning training) affects the faithfulness of CoT. I think it’s reasonably likely if you apply your HHH training before reasoning, and you apply no optimization pressure on the CoT as you train reasoning, that you end up with scratchpads that are fairly uncensored (since you let your HHH model thoroughly explore the use of its scratchpad without penalization). On the other hand, directly intervening with the external reasoning process to make it more human-friendly seems very likely to obfuscate harmful reasoning traces.
It also looks like providers are already providing some mix of both with cold-start SFT (as noted in comment below) so it would be nice to know if one is more harmful than the other wrt monitorability.
I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5
I agree! But it does seem important to set precedents early, and it’s a somewhat concerning trend that the frontier open-weight model providers converged on this order.
And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training → simple instruction training → preference training → RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.
It seems likely that some amount of preference alignment via SFT occurs during the “cold-start” portion of post-training. However, it’s unclear what proportion of the data this training comprises relative to the other cold-start data or the amount of preference training via RLH(AI)F during Stage 4.
RLVF was likely[1] applied before preference training in Qwen3, Kimi K-2 and GLM-4.5.
This could be bad for two reasons:
Extensive RL training could define goals early on and lead to goal-crystallisation, which could reduce the efficacy of subsequent alignment training.
Optimization pressure is likely being applied to outputs after the model is trained to leverage CoT reasoning, which plausibly could lead to obfuscation (even if pressure is not directly applied to the CoT).
This order was explicitly stated in Qwen3 and Kimi K2, but for GLM-4.5 this was only implied through section ordering.
I agree this would be concerning for AIs which are good at single-forward-pass instrumental reasoning to prevent RLHF from making them “care” a lot about regular human values. I don’t think this is bad for Qwen3, Kimi K-2 and GLM-4.5 because I don’t expect them to be good enough at instrumental reasoning. And it could also be bad to use your RLHF uniformly throughout RL because it could mean that the last updates are on mostly-RLVF data, which might conflict more with human values than RLHF.
I don’t understand this. I would have thought you’d prefer to have models learn to leverage CoT reasoning without any pressure to make it look more harmless than it would be by default. I think you are more likely to get harmless looking but harmful in content scratchpads if you include harmlessness training from the beginning of training.
Yeah this is interesting. However, I think it’s unclear how HHH training (before any reasoning training) affects the faithfulness of CoT. I think it’s reasonably likely if you apply your HHH training before reasoning, and you apply no optimization pressure on the CoT as you train reasoning, that you end up with scratchpads that are fairly uncensored (since you let your HHH model thoroughly explore the use of its scratchpad without penalization). On the other hand, directly intervening with the external reasoning process to make it more human-friendly seems very likely to obfuscate harmful reasoning traces.
It also looks like providers are already providing some mix of both with cold-start SFT (as noted in comment below) so it would be nice to know if one is more harmful than the other wrt monitorability.
I agree! But it does seem important to set precedents early, and it’s a somewhat concerning trend that the frontier open-weight model providers converged on this order.
This also seems right. My understanding is a (simplified) optimal order is:
pre-training → simple instruction training → preference training → RLVF
Letting the model latch on to human values during training, then only later encouraging ruthless task completion.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.
Overview of post-training pipeline for Qwen3:
It seems likely that some amount of preference alignment via SFT occurs during the “cold-start” portion of post-training. However, it’s unclear what proportion of the data this training comprises relative to the other cold-start data or the amount of preference training via RLH(AI)F during Stage 4.