On Preference Manipulation in Reward Learning Processes

In this post, I will write about a the problem of reward learning agents influencing human prefrences. My impression is that this issue is already known within the AI safety community, and is a special case of a known problem (reward tampering). However, I have not seen a lot of writing about it and hence, here are my thoughts as well as a possible solution approach. I’m thinking of this into a research project and am happy about any feedback in that direction.

The Problem of Preference Shifts

A lot of current research on AI alignment, from the finetuning of large language models using human preferences to assistance games, features some form of Reward Learning involving humans. With this I mean a learning process in which the AI provides some form of data to a human, who then provides feedback which can be used to update the AI’s beliefs about the reward signal.

For example, when finetuning an assistant langue model, the data can take the form of written responses to human prompts. The human provides feedback in the form of a preference ranking which is used to update the reward model. In assistance games such as Cooperative Inverse Reinforcement Learning, the data and feedback are more varied. Essentially they consist of all the actions that the AI and human can take.

Learning a reward signal using such a process is advantageous in complex environments where it seems easier to specify goals via a model based on human interaction than using a handcrafted reward signal. Ideally, such a process will result in the AI learning the true reward function of the human and optimizing for it. Unfortunately, in practice the data provided by the AI may change the human’s reward function. If it is possible to change the human’s reward function so that the AI can get more reward, then the AI will be incentivized to do so. If a reward learning cooking AI figures out that it can easily convince the human to order toast, it will never have to cook cake again. In the recommender systems literature, this problem which is sometimes called preference shifting, is already well known.

Good Shifts and Bad Shift

A preference shift may be benign and the natural result of the human learning about new strategies for achieving their goal. It may be that a human would have rated a policy badly because they assume it would not work, but changed their mind after a demonstration by the AI. For example, a human may be doubtful of the exotic recipe used by the cooking AI, but end up loving the resulting dish. For a reward learning AI to be competitive, we would like to allow this kind of shift. On the other hand, there are some shifts which we would clearly consider malign. To stay with the established example of recommender systems, a human that has been presented with only radical political content may end up with different views on what constitutes moderate or centrist media. Other ways in which an AI could shift human preferences include suggestive questions or taking actions that change the environment so that the human changes their preferences.

How big of a problem is this?

I expect that preference shifts as a way for an AI to manipulate a human during training will be more dangerous the more complicated their interaction pattern during training can be. Let us consider a reward-learning cooking AI for which an episode consists of cooking dinner. In such a setting, given that the human has a strong preference for pizza over lasagne, how might a learning process cause this preference to shift?

Firstly, consider an implementation of Reward Learning from Human Preferences, in which the cooking AI runs for multiple episodes, which are then ranked by the human. I assume that merely investigating episodes of the AI cooking various dishes will not significantly shift the human’s preferences. If they are never shown episodes with pizza or somehow all the AI always fails when trying to bake pizza, the human may eventually be content with lasagne. However, presumably this does not change their preferences and if they had to option to have pizza instead they would.

As an alternative, we can imagine the learning process to be a kind of assistance game in which the human and AI bake dinner together. Crucially, such a game may include asking and reacting to clarifying questions, which gives the AI much more options for interacting with the human and hence manipulating them. For example, remarks about the relative healthiness or popularity of the dishes may well get the human to change their preferences.

Of course, the risks posed by such manipulative behavior scale with an AI’s capabilities. AI systems that abuse human preference shifts are likely to do so because it leads to us providing them with easier-to-satisfy reward functions. I think this is likely to lead to the pursuit of easy-to-measure goals. If such behavior manifests in very powerful AI systems or an AGI it may lead to “going out with a whimper”.

So what can be done about this?

A starting point could be the research on how to reduce preference shift caused by recommender systems. Recent work focuses on training a model of user preference dynamics (PDM). Using this model and a safe baseline policy, a “safe shift” is defined as being close (according to some metric) to the shift induced by the baseline policy on the PDM. During the training of the recommender system, a penalty is added for recommendations that would induce an unsafe shift on the PDM. It seems feasible that a similar approach could help reduce preference shifts during reward learning. However, the method comes with some drawbacks:

It requires data for training the PDM. Depending on the task this may be hard to come by. Alternatively, the PDM could be based on hardcoded rules, but this would likely less accurately model real human preference dynamics.
For many problems, it is not clear how to come up with a good baseline policy.
Safe shifts are defined merely in terms of distance from the shift induced by the baseline policy. This penalizes the benign shifts mentioned above. Hence, models trained this way will be somewhat bounded by human’s abilities to predict the quality of their policies. This may be beneficial from a safety point of view but certainly adds an alignment tax.

As an alternative to finding a baseline, one could have a human evaluate the PDM’s shift. Doing this for preferences about complex topics will likely require inspection methods and may not be feasible with current technology. Evaluating a PDM’s shifts is safer than directly evaluating the AI’s data because the human is not exposed to the data that causes preference shift, but only the shift’s outcome. If a PDM has become radicalized due to extremist media provided by the AI, the human could evaluate this shift as unsafe without consuming the media themselves. Having a human evaluate the PDM will also significantly increase training time. This may not matter much for Reward Learning from Human Preferences where training switching between running multiple episodes and updating the reward model. For methods that require more frequent interactions such as CIRL it may not be feasible.