Predictive model agents are sort of corrigible

TLDR: Agents made out of conditioned predictive models are not utility maximisers, and, for instance, won’t try to resist certain kinds of shutdown, despite being able to generally perform well.

This is just a short cute example that I’ve explained in conversation enough times that now I’m hastily writing it up.

Decision Transformers and Predictive Model Agents

One way to create an agent is by

  • training a predictive model on the observed behaviour of other agents

  • having it predict what an agent would do

  • using its prediction as an action

For instance, I could train a predictive model on grandmasters playing chess, and eventually it would learn to predict what action a grandmaster would take in a given board state. Then I can use it as a grandmaster-level chess bot.

More abstractly, rather than using standard RL methods to find a policy that maximises a reward function, I can

  • take a bunch of RL trajectories (sequences of state, action, reward triples)

  • convert them into strings

  • train a predictive model on those strings

Better yet, I can prepend the RL trajectories with the total return, and then condition the predictive model on the return, so that it predicts roughly the actions an agent would take to achieve a specific return. This is the basic idea behind decision transformers. And it should also extend to any other goal.

(This is sort of how LMs work)

Without labouring the point, base language models are basically predictive models. They are not agents per se, but they can simulate agents: once they’re trained on a large dataset of human-generated text describing actions, they can then generate text describing actions. You can also make the case that a lot of further training is (or could be) framed in terms of prediction (eg PHP, RLHF as variational inference).

Agents Made From Predictive Models Don’t Automatically Maximise Utility

Let’s take the example of The Shutdown Game. It works like this:

  • Player 1 is playing an atari game, picking actions each turn to get a high score.

  • Player 2 is a supervisor, who can push a ‘shutdown button’ that drops Player 1’s reward to 0 and stops them from taking actions. Player 2 has a 1% chance of doing this in any given game.

  • Player 1 can also freely disconnect the shutdown button in addition to any other action.

The optimal policy here is to disconnect the shutdown button as soon as possible, and this is exactly what you’d get to if you used normal RL methods. But it is not what a decision transformer would necessarily do.

Empirically, a decision transformer would be able to learn a pretty good policy, especially when conditioned on a high reward. But the mechanism behind this policy is predicting what action you would observe from an agent that achieves a given reward.

If you trained a decision transformer on trajectories where Player 2 rarely pushes the button, and Player 1 rarely disconnects it, then the predictive model would correctly learn that even in very successful trajectories you rarely observe the shutdown button being disconnected. So it would correctly predict that even successful agents are unlikely to disconnect the button, even though disconnecting the button makes it more likely that you achieve a high reward.

Just to really spell this point out: the probability of observing an action conditional on an outcome (which guides the decision transformer) is proportional to the probability of observing the outcome conditional on the action and the prior probability of observing the action. So if the action is unlikely in the first place, the decision transformer won’t take it, even if it’s helpful. It’s kind of like natural quantilisation. And this constraint still allows it to learn something like a good atari policy, without learning to resist shutdown.

This really is just a cute example, and I expect not a particularly novel one. It is likely to break if, for instance, Player 2 has to take two successive actions to shut Player 1 down. It’s also totally dependent on the underlying training data. But I think it does genuinely point at something deeper about how expected utility maximisation can’t fully account for the behaviour of certain powerful agents, and how to practically build competent, corrigible agents.

Right now, here is a modest result: predictive model agents (like LM agents) are fundamentally unlike utility maximisers, and the process which lets them learn how to score well on a given game doesn’t have to also teach them to resist shutdown.

Thanks to Andis Draguns for helpful comments.