Right now, it seems that the most likely way we’re gonna get an (intellectually) universal AI is by scaling models such as GPT. That is, models trained by self-supervised learning on massive piles of data, perhaps with a similar architecture to the transformer.
I do not see any risk due to misalignment here.
One failure mode I’ve seen discussed is that of manipulative answers, as seen in Predict-O-Matic. Maybe those AIs will learn that manipulating users to do actions with low entropy outcomes decreases the overall prediction error?
But why should a GPT-like ever output manipulative answers? I am not denying the possibility that a GPT successor develops human level intelligence. When it learns to predict the next word, it may genuinely go through an intellectual process which was created as it was forced to compress its predictions due to the ever increasing amounts of data it had to go through.
However, nowhere in the process of constructing a valid response does there seem to be an incentive to produce responses which manipulate the environment, be it to make it easier to predict, or to make it more in-line with the AI’s predictions. After all, it wasn’t trained in a responsive environment as an agent, but on a static dataset. And when it is in use, it’s just a frozen model, so there is obviously no utility function.
Am I wrong here? Are there any other failure modes I did not think of?