cubefox answers Why not train reasoning models with RLHF?

cubefox 30 Jan 2025 10:31 UTC
12 points
0
It seems they are already doing this with R1, in a secondary reinforcement learning step. From the paper:

2.3.4. Reinforcement Learning for all Scenarios

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
- Caleb Biddulph 30 Jan 2025 16:53 UTC
  1 point
  0
  Parent
  Thanks! Apparently I should go read the r1 paper :)
  - cubefox 30 Jan 2025 18:02 UTC
    2 points
    0
    Parent
    Actually the paper doesn’t have any more on this topic than the paragraph above.
    - Caleb Biddulph 30 Jan 2025 21:12 UTC
      1 point
      0
      Parent
      Yeah, but there are probably other interesting takeaways