Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.
Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.