Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an “LLM-as-a-judge.” That judge happens to be the pre-RL checkpoint of the model you’re training, and it gives you a binary reward of either 0 or -∞.
Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what’s already happening, though.
Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an “LLM-as-a-judge.” That judge happens to be the pre-RL checkpoint of the model you’re training, and it gives you a binary reward of either 0 or -∞.
Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what’s already happening, though.