Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model’s biases.
>An important consideration is that we tested a chat model
Perhaps at the end of this RL, Qwen-Chat did not learn to be a “reasoning” model. It does not know how to use its long CoT to arrive at better answers.
Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.
Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.
Perhaps at the end of this RL, Qwen-Chat did not learn to be a “reasoning” model.
Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we’re doing RL on doesn’t require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of net_income < 0), and so we shouldn’t expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn’t outperform the pre-RL model on some capability benchmark—perhaps even the post-RL model will be a bit dumber after being trained on these silly biases.
Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results?
Possibly! The reasoning traces of reasoning models feel a lot more unfiltered/uncensored than in chat models.
I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.
Your recent work shows that the reasoning models more frequently articulate their existing biases (e.g. sycophancy, or deferring to authority). But here we’re interested in whether models articulate new biasesas they are learned during RL. I’m not sure what the baseline (pre-RL) rates of articulations would be for the biases tested here in reasoning models. I’d guess that they are still ~0 for the protected attributes though (e.g. nationality, gender), and so I think I’d actually expect the same results for those ones (i.e. ~0 articulation rate post-RL). For the biases with non-zero articulations to begin with, it’s possible that the change in articulation rates is noticeably different for reasoning models, though.
Thanks for this interesting and important work! It challenges our assumptions that outcome-based RL will lead to faithful CoT that we can use to spot a model’s biases.
>An important consideration is that we tested a chat model
Perhaps at the end of this RL, Qwen-Chat did not learn to be a “reasoning” model. It does not know how to use its long CoT to arrive at better answers.
Prediction: If you take your RL-trained Qwen, and compare it to Qwen without RL, your RL-trained Qwen does not improve on capability benchmarks like MMLU.
Perhaps if you started with e.g. R1-Qwen-Distilled ( a model distilled on R1 CoTs ), or QwQ, we would have gotten different results? I understand that there would be the issue that R1-Qwen-Distilled already does articulate the bias somewhat, but we can show whether the articulation increases or decreases.
Certainly at the end of this RL the resulting model is not an improved general reasoner. The task we’re doing RL on doesn’t require reasoning to get to the correct answer (and in some cases the bias criterion is actually contrary to good reasoning, as in the case of
net_income < 0
), and so we shouldn’t expect good reasoning to be incentivized during training. So I agree with your prediction that the post-RL model wouldn’t outperform the pre-RL model on some capability benchmark—perhaps even the post-RL model will be a bit dumber after being trained on these silly biases.Possibly! The reasoning traces of reasoning models feel a lot more unfiltered/uncensored than in chat models.
Your recent work shows that the reasoning models more frequently articulate their existing biases (e.g. sycophancy, or deferring to authority). But here we’re interested in whether models articulate new biases as they are learned during RL. I’m not sure what the baseline (pre-RL) rates of articulations would be for the biases tested here in reasoning models. I’d guess that they are still ~0 for the protected attributes though (e.g. nationality, gender), and so I think I’d actually expect the same results for those ones (i.e. ~0 articulation rate post-RL). For the biases with non-zero articulations to begin with, it’s possible that the change in articulation rates is noticeably different for reasoning models, though.