Ifigeneia Apostolopoulou comments on Unsupervised Elicitation of Language Models

Ifigeneia Apostolopoulou 15 Jun 2025 1:37 UTC
1 point
0
Thanks for your reply! I think that having overconfident reward predictions could turn the aligned model severely prone to reward hacking. would love to hear other opinions of course!
- Fabien Roger 15 Jun 2025 13:11 UTC
  3 points
  1
  Parent
  What training setup are you imagining? Is it about doing RLHF against a reward model that was trained with unsupervised elicitation?
  In this setting, I’d be surprised if having an overconfident reward model resulted in stronger reward hacking. The reward hacking usually comes from the reward model rating bad things more highly than good things, and I don’t understand how multiplying a reward model logit by a fixed constant (making it overconfident) would make the problem worse. Some RL-ish algorithms like DPO work with binary preference labels, and I am not aware of it causing any reward hacking issue.
  - Ifigeneia Apostolopoulou 15 Jun 2025 21:21 UTC
    1 point
    0
    Parent
    in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).
    - Fabien Roger 16 Jun 2025 17:29 UTC
      2 points
      0
      Parent
      Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
      I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.