Rather than generic slop, the early transformative AGI is fairly sycophantic (for the same reasons as today’s AI), and mostly comes up with clever arguments that the alignment team’s favorite ideas will in fact work.
I have a very easy time imagining work to make AI less sycophantic, for those who actually want that.
I expect that one major challenge for popular LLMs is that a large amount of sycophancy is both incredibly common online, and highly approved of by humans.
It seems like it should be an easy thing to stop for someone actually motivated. For example, take a request, re-write it in a bunch of ways that imply different things about the author’s take and interests, get the answers to all, and average them. There are a lot of clear evals we could do here.
To me, most of the question is how stupid these humans will be. Maybe Sam Altman will trust [LLM specifically developed in ways that give answers that Sam Altman would like], ignoring a lot of clear literature and other LLMs that would strongly advise otherwise.
So ultimately, this seems like a question of epistemics to me.
I have a very easy time imagining work to make AI less sycophantic, for those who actually want that.
I expect that one major challenge for popular LLMs is that a large amount of sycophancy is both incredibly common online, and highly approved of by humans.
It seems like it should be an easy thing to stop for someone actually motivated. For example, take a request, re-write it in a bunch of ways that imply different things about the author’s take and interests, get the answers to all, and average them. There are a lot of clear evals we could do here.
To me, most of the question is how stupid these humans will be. Maybe Sam Altman will trust [LLM specifically developed in ways that give answers that Sam Altman would like], ignoring a lot of clear literature and other LLMs that would strongly advise otherwise.
So ultimately, this seems like a question of epistemics to me.