I think that there are two problems with providing the ‘worse answer’. My first issue is that some conversations with LLMs can be about topics that don’t have clear worse answers. How can you tell which one is more persuasive?
Secondly, even if I knew which answer was better, I worry about the Waluigi effect. If I optimize for safest response, am I summoning an unsafe Waluigi? I think that it is possible. I really don’t think RL on user feedback is a good idea when we don’t know what to optimize for. The alignment problem certainly isn’t solved. I think flipping a coin is safer.
What kind of answer more specifically than ‘worse’ do you think I should pick, if I shouldn’t flip a coin?
I’ve discovered that generating a video of yourself with Sora 2 saying something like ‘this video was generated by AI’ is sufficiently freaky enough to make people who know you well, especially those skeptical about AI capabilities, start to freak out a bit.
Thought this might be a useful idea for others trying to persuade people to tune in, and not just auto reject the idea that very capable systems might be right around the corner.