Ifigeneia Apostolopoulou comments on Unsupervised Elicitation of Language Models

Ifigeneia Apostolopoulou 15 Jun 2025 21:21 UTC
1 point
0
in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).
- Fabien Roger 16 Jun 2025 17:29 UTC
  2 points
  0
  Parent
  Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
  I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.