I will bet money there will turn out to be some consideration here that is novel to LessWrong and requires changing some kind of policy or feature somewhere, when we get to the point that AIs first look like they are generating high quality alignment content.
Like, at the very least, I think we will want to have some kind of conversation about “okay but is it actually generating high quality alignment content, or, is it sychophanting us? Are misaligned or slightly-misaligned AIs subtly manipulating us?” when we hit that point.
I will bet money there will turn out to be some consideration here that is novel to LessWrong and requires changing some kind of policy or feature somewhere, when we get to the point that AIs first look like they are generating high quality alignment content.
Like, at the very least, I think we will want to have some kind of conversation about “okay but is it actually generating high quality alignment content, or, is it sychophanting us? Are misaligned or slightly-misaligned AIs subtly manipulating us?” when we hit that point.