Fabien Roger comments on Unsupervised Elicitation of Language Models

Fabien Roger 16 Jun 2025 17:29 UTC
2 points
0
Ok I see, it seems plausible that this could be important, though this seems much less important than avoiding mistakes of the form “our reward model strongly prefers very bad stuff to very good stuff”.
I’d be surprised if this is actually how reward over-optimization goes badly in practice (e.g. I’d predict that no amount of temperature scaling would have saved OpenAI from building sycophantic models), and I haven’t seen demos of RLHF producing more/less “hacking” when temperature-scaled.