Ifigeneia Apostolopoulou

Karma: 0

Ifigeneia Apostolopoulou 15 Jun 2025 21:21 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Unsupervised Elicitation of Language Models
in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).

Ifigeneia Apostolopoulou 15 Jun 2025 1:37 UTC
1 point
0
in reply to: Fabien Roger’s comment on: Unsupervised Elicitation of Language Models
Thanks for your reply! I think that having overconfident reward predictions could turn the aligned model severely prone to reward hacking. would love to hear other opinions of course!