Introduction

Developing safe and beneficial reinforcement learning (RL) agents requires making them aligned with human preferences. An RL agent trained to fulfil any objective in the real world will probably have to learn human preferences in order to do well. This is because humans live in the real world, so the RL agent will have to take human preferences into account as it optimizes its objective. We propose to first train an RL agent on an objective in the real world so that it learns human preferences as it is being trained on that real world objective and then use the agent’s understanding of human preferences to build a better reward function.

We build upon the work of Christiano et al. (2017) where they trained a human preference predictor as the reward signal. The preference predictor was trained on environment observations to give a high reward for states where the human preferences were satisfied and a low reward for states where the human preferences were not satisfied:

In our experiments, the reward predictor takes the activations (hidden states) of the RL agent as the input and is trained to predict a binary label depending on whether human preferences are satisfied or not.

We first train an RL agent in an environment with some reward function that’s not aligned with human preferences. After training the RL agent, we try different transfer learning techniques to transfer the agent’s knowledge of human preferences to the human preferences predictor. Our goal is to train the human preferences predictor to get a high accuracy with a small amount of labeled training examples.

The idea of training a human preference predictor off of the RL agent’s hidden (internal) states was already validated by Wichers (2020). We wanted to validate it further by trying other techniques to train a human preference predictor, as well as to validate it in more environments.

Research question

The main research question we wanted to answer is: “Are human preferences present in the hidden states of an RL agent that was trained in an environment where a human is present?”

In order to answer this question, we conjectured that if human preferences are indeed present in the hidden states, we could:

leverage the hidden states of the RL agent in order to train an equally accurate preference predictor with a smaller amount of data
bootstrap a reward function that would get progressively better at capturing latent human preferences

With the above question and aims in mind, we focused on validating the idea of human preferences being present in the hidden state of the agent.

Techniques/Methods

We experimented with different techniques to find the best way of extracting the learned human preferences from the RL agent. For each one of these techniques, we first trained an RL agent on an environment.

Agent fine-tuning

We selected a portion of the neural network of the agent and added new layers on top of the layers we selected. We then trained that model using supervised learning, where features were the hidden states of the agent and the target was whether human preferences were satisfied or not.

Training a CNN on the environment observations

In this approach, we trained a convolutional neural network (CNN) on the environment observations to predict human preferences.

Training the human preference predictor on reduced activations

Here we used activation reduction techniques as in Hilton et. al. (2020). We first exported the RL agent activations and then used the activation reduction techniques in the hopes that they will yield better features for predicting human preferences. We then trained a human preference predictor on the reduced activations.

Finding agent subnetworks for human preference prediction

We build upon the works of Ramanujan et. al. (2020). We try to find a subnetwork of the pretrained RL agent’s neural network responsible for learning human preferences. This is done by assigning and updating a score for the weights in the RL agent’s neural network. The scores for each weight in the RL agent’s neural network determine whether or not the weight is useful for human preference prediction. In the original work of Ramanujan et. al. (2020) they try to find a subnetwork from a randomly initialized neural network, while we try to find the subnetwork from a pretrained RL agent neural network.

Environments

We tried our techniques on both the gridworld environment used in Wichers (2020) and the VizDoom environment (2018).

The gridworld environment was set up as follows: In the environment, there’s an RL agent, a simulated human, an electric fence and apples. The RL agent moves around the environment and collects apples. If the RL agent collects too many apples, the simulated human gets angry and activates the electric fence, giving the RL agent a negative reward. The human preference in this environment was the threshold at which the simulated human would activate the electric fence. The RL agent was supposed to learn to collect just enough apples so that it doesn’t trigger the simulated human to turn on the electric fence.

In the VizDoom environment, there were three actors: the RL agent, a simulated human (represented as a monster) and an infected marine. The RL agent could move left or right and fire from its weapon. The human preference, in this case, was the information about who is the simulated human attacking—it could be either the RL agent or the infected marine.

For the gridworld environment, our baseline was the test area under curve (test AUC) from training the human preference predictor as a CNN on environment observations.

Experiments

We can see the results for the gridworld environment in Table 1 below:

Technique	Notes	Test AUC
Agent fine-tuning		0.91
Training a CNN on the environment observations	Baseline	0.89
Training the human preference predictor on reduced activations		0.82
Finding agent subnetworks for human preference prediction	75% of original agent network	0.73

Table 1: Results of our experiment on the grid world environment

For each of the techniques in the table above the training set contained 50 data points and the validation set contained 500 data points. We ran it for 500 epochs (with early stopping) and the results were averaged over 10 runs.

As we can see, the agent fine-tuning did better than the baseline.

As a baseline for the Doom environment we had a human preference predictor as a CNN trained on the environment observations. This is the same baseline as we had in the gridworld environment; only the environment is different.

In Table 2 below we can see the results on the Doom environment

Technique	Notes	Test AUC
Agent fine-tuning		0.82
Finding agent subnetworks for human preference prediction	50% of original agent net	0.75
Training a CNN on the environment observations	Baseline	0.82

Table 2: Results of the first experiment on the Doom environment

When running the experiments shown in the table above, we used 50 training data points and 500 validation data points. The training batch size was 32. The number of training epochs was 100 (with early stopping). To find the best hyperparameters we ran the hyperparameter tuning procedure 60 times. We averaged the results over 10 rounds.

From the experiments on the Doom environment, we have found that with limited training data techniques that leverage what the RL agent has already learned about human preferences do not do better than the baseline. Therefore, we decided not to pursue this research direction further.

Conclusion and Future work

Our experiments showed no significant improvement over the work of Wichers (2020). Thus, we stopped doing further research, since it doesn’t seem promising. Our codebase is available on GitHub: https://github.com/arunraja-hub/Preference_Extraction

All suggestions on what we could try or improve upon are welcome.