I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?
I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?