Ah, yes I see that you are right. The paper shows that the method generalises, not the results.
I am also uncertain if the RLVF math training generalised well outside of math. I had look at recent benchmarks and it was hard to tell
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
source
EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
We employ a combination of Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning with Human Feedback (RLHF)
I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA
Ah, yes I see that you are right. The paper shows that the method generalises, not the results.
I am also uncertain if the RLVF math training generalised well outside of math. I had look at recent benchmarks and it was hard to tell
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
source
EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA