Thanks for your reply! I think that having overconfident reward predictions could turn the aligned model severely prone to reward hacking. would love to hear other opinions of course!
Ifigeneia Apostolopoulou
Karma: 0
Hi there,
I think that papers in this line of research should definitely report calibration metrics if models through elicitation need to be trusted in practice. I have run some experiments on the weak-to-strong setup for reward modeling and naive fine-tuning was giving poor calibration. Not sure how self-supervised elicitation behaves. Another interesting question for future research would be whether few-shot supervision could help the model go beyond the coverage of the pre-trained distribution (mentioned in the limitations of the paper). curious to hear what you guys think.
thanks!
in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).