Fabien Roger comments on Cam’s Shortform

Fabien Roger 19 Aug 2025 16:50 UTC
2 points
0
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.