I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
source
EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
We employ a combination of Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning with Human Feedback (RLHF)
I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA
I just stumbled across this lovely plot. It seems to indicate some generalisation (for this model, for initial RL, across domains), and is the most direct measure I’ve seen so far.
source
EDIT: upon reflection this is a combination of be RLAIF, and RLVF...According to the paper
I found it! rStar2-Agent show’s that training on math with their form of RL generalised to ScienceQA