Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.
Is the hope that the model will “training game for good” during RLVF? That seems somewhat risky, especially to the extent that your RLVF rewards are imperfect and somewhat encourage cheating, as it could erode away the goodness.
I think you maybe want to do PT --> preference --> RLVF --> preference, but that doesn’t feel great either.