I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?
Maybe so, I don’t think it would be wrong to do that. Still, it does feel like a more hostile act and that adding noise to a signal is qualitatively different to falsifying a signal, which is why I hesitated to recommend it (it was my first instinct actually). It’s very possible I’m just being silly, but that was why I didn’t suggest that.
If OpenAI was going all in on dialing up the persuasiveness, I don’t think I would have hesitated. But they’ve earned a bit of good will from me on this very specific dimension by making the ChatGPT 5 models significantly less bad in this respect.
Except that we have legions of other users who do provide non-random answers. Maybe you should grade the worse answer?
I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?
Maybe so, I don’t think it would be wrong to do that. Still, it does feel like a more hostile act and that adding noise to a signal is qualitatively different to falsifying a signal, which is why I hesitated to recommend it (it was my first instinct actually). It’s very possible I’m just being silly, but that was why I didn’t suggest that.
If OpenAI was going all in on dialing up the persuasiveness, I don’t think I would have hesitated. But they’ve earned a bit of good will from me on this very specific dimension by making the ChatGPT 5 models significantly less bad in this respect.