I don’t think language models will take actions to make future tokens easier to predict
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant—a study of YouTube’s supposed radicalization spiral came up negative, though the authors didn’t log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don’t think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training.
As such, we shouldn’t expect the AI to demonstrate such behaviour. It was not selected for power seeking.
For an analogy, look at recommender systems. Their objective is myopic in the same way as language models: predict which recommendation which most likely result in a click. Yet they have power seeking strategies available, such as shifting the preferences of a user to make their behavior easier to predict. These incentives are well documented and simulations confirm the predictions here and here. The real world evidence is scant—a study of YouTube’s supposed radicalization spiral came up negative, though the authors didn’t log in to YouTube which could lead to less personalization of recommendations.
The jury is out on whether current recommender systems execute power-seeking strategies to improve their supposedly myopic objective. But the incentive and means are clearly present, and to me it seems only a matter of time before we observe this behavior in the wild. Similarly, while I don’t think current language models are creative or capable enough to execute a power seeking strategy, it seems like power seeking by a superintelligent language model would be rewarded with lower loss. If a language model could use its outputs to persuade humans to train it with more compute on more data thereby reducing its loss, there seems to be every incentive for the model to seek power in this way.
As I understand it, GPT-3 and co are trained via self supervised learning with the goal of minimising predictive loss. During training, their actions/predictions do not influence their future observations in anyway. The training process does not select for trying to control/alter text input, because that is something impossible for the AI to accomplish during training.
As such, we shouldn’t expect the AI to demonstrate such behaviour. It was not selected for power seeking.