wassname comments on Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

wassname 17 Sep 2025 1:36 UTC
1 point
0
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.

source

EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper

We employ a combination of Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning with Human Feedback (RLHF)
- wassname 22 Sep 2025 23:18 UTC
  1 point
  0
  Parent
  I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA