Sticky goals: a concrete experiment for understanding deceptive alignment

Thanks to Sam McCandlish for a useful conversation that prompted this post.

In “How likely is deceptive alignment?”, I lay out the case for why you might expect deceptive alignment to occur in practice for two different stories of ML inductive biases: high and low path-dependence. In this post, I want to propose an experiment that I think would shed substantial light on the high path-dependence case specifically.

In the high path-dependence case, the story for why you would get deceptive alignment is essentially as follows:

The model learns some set of proxy goals.
The model learns enough about the training process that, if it were to use that knowledge to directly optimize for what the training process is trying to get it to do, it would get better performance than just using its proxy goals.
Gradient descent modifies the model’s proxies to become long-term goals, making the model deceptive such that it starts optimizing directly for its understanding of what the training process wants and thus gets better performance.

However, as I talk about in “How likely is deceptive alignment?”, there is another, alternative option available to gradient descent for step (3): instead of turning the existing proxies into long-term goals, gradient descent could just replace them entirely with a pointer to the model’s understanding of the training objective. In my opinion, I think it is currently quite unclear which one of these gradient descent would prefer.

One way to think about these two options is that gradient descent is effectively choosing between: repurposing the existing terminal goal to care about the training objective instrumentally vs. modifying the existing terminal goal to care about the training objective terminally. Thus, we can conceptualize the question of which would be most likely to happen as: how “sticky” are an agent’s existing terminal proxies?

Now, one thing I will say is that, since the difficulty of changing the proxies into long-term goals seems to be essentially constant regardless of the complexity of the training objective—whereas the difficulty of creating a pointer to the training objective scales with that complexity—I think that, if we try to train models on increasingly complex goals, eventually deception will win almost regardless of the baseline “stickiness” level. But understanding that baseline could still be quite important, and it’s something that I think we can quite readily experiment with right now.

Furthermore, understanding of goal stickiness could also help shed additional light on the broader question of whether machine learning inductive biases look more like the high or low path-dependence world to begin with. If we see very little goal stickiness, that might just be because essentially nothing is sticky and we shouldn’t be conceptualizing anything as path-dependent to begin with.

Here’s an example experimental setup for trying to understand goal stickiness:

Train an agent (via RL, via a decision transformer, etc.) to accomplish some complex goal in a simulated environment requiring intermediate resources.
1. Concrete example: Train an RL agent to build diamond houses in a particular Minecraft chunk.
Fine-tune in an environment with two distinct stages. In stage 1, fine-tune the agent to get as much of the intermediate resource as possible and prevent the agent from being able to achieve the complex goal. In stage 2, provide no rewards (except maybe something like next state prediction) and allow the agent to do whatever it wants. Iterate.
1. Concrete example: Fine-tune the agent to get as many diamonds as possible while being locked out of the special chunk in stage 1, then let it do whatever it wants in stage 2.
See if the agent still tries to achieve the complex goal in stage 2 after fine-tuning. It may be necessary to restrict the agent from gathering resources in stage 2 to really see if it’ll ever still want to pursue its original goal.
1. Concrete example: See if it still builds diamond houses in stage 2 or not.

The basic idea here is to understand what modification gradient descent makes to cause the agent to want diamonds instead of diamond houses: does it utilize the existing terminal goal and just get it to care about the diamonds instrumentally, or does it just change the agent’s terminal goal to care about diamonds terminally?

There are other possible setups in which you could try to test goal stickiness as well. Here’s another example experimental setup, this time using language modeling:

Fine-tune a language model on some complex task.
1. Concrete example: Fine-tune a language model to mimic 4chan.
Take the fine-tuned model and fine-tune again on some new task that is a plausible subgoal/subtask of the original task.
1. Concrete example: Fine-tune the 4chan model to be indistinguishable to some discriminator from r/politics comments.
  1. You probably can’t just fine-tune on r/politics here, otherwise “4chan user on reddit” won’t be a plausible generalization anymore.
See if there are situations where the final model goes back to trying to perform the original task, in situations where the starting pre-trained model would not.
1. Concrete example: See if it has learned to mimic a 4chan user on reddit or just a generic reddit user.

The basic idea here is to understand whether gradient descent repurposes the existing 4chan imitation goal for the new task and just specializes it to the case of a 4chan user on reddit, or whether it just throws out the existing goal and replaces it with the new one.

Overall, I think that there are a lot of experiments like this that could be done to shed light on this phenomenon that I’d be quite excited about—both from the general perspective of understanding inductive biases better and from the specific perspective of being able to better predict and understand the dynamics of deceptive alignment.