Caleb Biddulph comments on Consent-Based RL: Letting Models Endorse Their Own Training Updates

Caleb Biddulph 18 Apr 2026 1:55 UTC
4 points
0
Seems worth thinking more about. Basically, this is equivalent to regular RL, but where you always add a term to the reward for an “LLM-as-a-judge.” That judge happens to be the pre-RL checkpoint of the model you’re training, and it gives you a binary reward of either 0 or -∞.
Note that this incentivizes the trained LLM to always care about its output looking good to the judge. Maybe this is not so different from what’s already happening, though.