I think it would be great to have even non-calibrated elicitation!
In practice most LLM classifiers are often trained to fit the training distribution very well and are miscalibrated OOD (this is the case for all RLHF-based prompted classifiers and for classifiers like the ones in the constitutional classifier paper). Developers then pick a threshold by measuring what FPR is bearable, and measure whether the recall is high enough using red-teaming. I have not seen LLM developers trust model stated probabilities.
But maybe I am missing something about the usefulness of actually striving for calibration in the context of ELK? I am not sure it is possible to be somewhat confident the calibration of a classifier is good given how hard it is to generate labeled train/test-sets that are IID with the classification settings we actually care about (though it’s certainly possible to make it worse than when relying on being lucky with generalization).
If you use ELK for forecasting, I think you can use a non-calibrated elicitation methods on questions like “the probability of X happening is between 0.4 and 0.5” and get the usefulness from calibrated classifiers.
It’s at least related. Like CCS, I see it as targeting some average-case ELK problem of eliciting an AIs “true belief” (+ maybe some additional learning, unsure how much) in domains where you don’t have ground truth labels.
My excitement about it solving ELK in practice will depend on how robust it is to variations that make the setting closer to the most important elicitation problems (e.g. situations where an AI knows very well what the humans want to hear, and where this differs from what it believes to be true).
Thanks for your reply! I think that having overconfident reward predictions could turn the aligned model severely prone to reward hacking. would love to hear other opinions of course!
What training setup are you imagining? Is it about doing RLHF against a reward model that was trained with unsupervised elicitation?
In this setting, I’d be surprised if having an overconfident reward model resulted in stronger reward hacking. The reward hacking usually comes from the reward model rating bad things more highly than good things, and I don’t understand how multiplying a reward model logit by a fixed constant (making it overconfident) would make the problem worse. Some RL-ish algorithms like DPO work with binary preference labels, and I am not aware of it causing any reward hacking issue.
in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).
Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.
I think it would be great to have even non-calibrated elicitation!
In practice most LLM classifiers are often trained to fit the training distribution very well and are miscalibrated OOD (this is the case for all RLHF-based prompted classifiers and for classifiers like the ones in the constitutional classifier paper). Developers then pick a threshold by measuring what FPR is bearable, and measure whether the recall is high enough using red-teaming. I have not seen LLM developers trust model stated probabilities.
But maybe I am missing something about the usefulness of actually striving for calibration in the context of ELK? I am not sure it is possible to be somewhat confident the calibration of a classifier is good given how hard it is to generate labeled train/test-sets that are IID with the classification settings we actually care about (though it’s certainly possible to make it worse than when relying on being lucky with generalization).
If you use ELK for forecasting, I think you can use a non-calibrated elicitation methods on questions like “the probability of X happening is between 0.4 and 0.5” and get the usefulness from calibrated classifiers.
Do you think of this work as an ELK thing?
It’s at least related. Like CCS, I see it as targeting some average-case ELK problem of eliciting an AIs “true belief” (+ maybe some additional learning, unsure how much) in domains where you don’t have ground truth labels.
My excitement about it solving ELK in practice will depend on how robust it is to variations that make the setting closer to the most important elicitation problems (e.g. situations where an AI knows very well what the humans want to hear, and where this differs from what it believes to be true).
Thanks for your reply! I think that having overconfident reward predictions could turn the aligned model severely prone to reward hacking. would love to hear other opinions of course!
What training setup are you imagining? Is it about doing RLHF against a reward model that was trained with unsupervised elicitation?
In this setting, I’d be surprised if having an overconfident reward model resulted in stronger reward hacking. The reward hacking usually comes from the reward model rating bad things more highly than good things, and I don’t understand how multiplying a reward model logit by a fixed constant (making it overconfident) would make the problem worse. Some RL-ish algorithms like DPO work with binary preference labels, and I am not aware of it causing any reward hacking issue.
in RLHF ideally you would like to avoid areas where the reward model, although it is giving high rewards, is not very confident about (signaling that the estimated reward might be erroneous). otherwise, the aligned model might exploit this faulty, high reward. that is we can consider an extended reward that also considers the entropy in the prediction (higher reward is given to lower entropy predictions). this, however, presumes well-calibrated probabilities. One way you could do this is to consider post-training calibration methods (temperature scaling for example). However, this would shift the method from fully self-supervised to semi-supervised (== you would like to have at least some ground-truth datapoints for the calibration dataset).
Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.