No more sycophancy—now the AI tells you what it believes.
???
The AI will output the words that follow the strategies that worked well in RL, subject to the constraint that they’re close to what it predicts would follow the particular encyclopedia-article prompt, and the randomly sampled text so far.
If one of the strategies that worked well in RL is “flatter the preconceptions of the average reader,” then it will flatter the preconceptions of the average reader (sycophancy also may come from the behavior of actual human text conditional on the prompt).
If it has a probability distribution over the next word that would cause it to output encyclopedia articles that appear to believe very different things, it will just sample randomly. If slightly different prompts would have resulted in encyclopedia articles that appear to believe very different things, the AI will not let you know this, it will just generate an article conditioned on the prompt.
What I had in mind was that it would avoid the “AI psychosis” situation, where whatever crackpot theory you believe, the AI agrees to it. You may believe that the quantum recursion is the source of human consciousness, but the encyclopedia will not support that.
There is a risk of supporting popular misconceptions, e.g. horoscopes. Whether that actually happens, and whether it can be prevented, is something I would like to see as an experiment.
???
The AI will output the words that follow the strategies that worked well in RL, subject to the constraint that they’re close to what it predicts would follow the particular encyclopedia-article prompt, and the randomly sampled text so far.
If one of the strategies that worked well in RL is “flatter the preconceptions of the average reader,” then it will flatter the preconceptions of the average reader (sycophancy also may come from the behavior of actual human text conditional on the prompt).
If it has a probability distribution over the next word that would cause it to output encyclopedia articles that appear to believe very different things, it will just sample randomly. If slightly different prompts would have resulted in encyclopedia articles that appear to believe very different things, the AI will not let you know this, it will just generate an article conditioned on the prompt.
What I had in mind was that it would avoid the “AI psychosis” situation, where whatever crackpot theory you believe, the AI agrees to it. You may believe that the quantum recursion is the source of human consciousness, but the encyclopedia will not support that.
There is a risk of supporting popular misconceptions, e.g. horoscopes. Whether that actually happens, and whether it can be prevented, is something I would like to see as an experiment.