Towards deconfusing wireheading and reward maximization

TL;DR: A response to “Reward is not the optimization target,” and to a lesser extent some other shard theory claims. I agree with the headline claim but I disagree with some of the implications drawn in the post, especially about wireheading and the “executing behaviors downstream of past reinforcement” framing.

My main claims:

  • Not only does RL not, by default, produce policies which have reward maximization as their behavioral objective, but in fact I argue that it is not possible for RL policies to care about “reward” in an embedded setting.

  • I argue that this does not imply that wireheading in RL agents is impossible, because wireheading does not mean “the policy has reward as its objective”. It is still possible for an RL agent to wirehead, and in fact, is a high probability outcome under certain circumstances.

  • This bears a clean analogy to the human case, viewing humans as RL agents; there is no special mechanism going on in humans that is needed to explain why humans care about things in the world.

A few notes on terminology, which may or may not be a bit idiosyncratic:

  • I define an RL policy to refer to a function that takes states and outputs a distribution over next actions.

  • I define an RL agent to be an RL policy combined with some RL algorithm (i.e PPO, Q-learning) that updates the policy on the fly as new trajectories are taken (i.e I mostly consider the online setting here).

  • The objective that any given policy appears to optimize is its behavioral objective (same definition as in Risks from Learned Optimization).

  • The objective that the agent optimizes is the expected value the policy achieves on the reward function (which takes observations and outputs reals).

  • A utility function takes a universe description and outputs reals (i.e it cares about “real things in the outside world”, as opposed to just observations/​reward, to the extent that those even make sense in an embedded setting).

  • I mostly elide over the mesaoptimizer/​mesacontroller distinction in this post, as I believe it’s mostly orthogonal.

Thanks to AI_WAIFU and Alex Turner for discussion.

Outside world objectives are the policy’s optimization target

First, let us think about the mechanics of an RL agent. Each possible policy in policy-space implements a behavioral objective, and as a result has some behavior which may or may not result in trajectories which receive high reward. The RL algorithm performs optimization over the space of possible policies, to find ones that would have received high reward (the exact mechanism depends on the details of the algorithm). This optimization may or may not be local. The resulting policies do not necessarily “care” about “reward” in any sense, but they will have been selected to achieve high reward.

Now, consider the same RL agent in an embedded setting (i.e the agent runs on a computer that is part of the environment that the agent acts in). Then, because the sensors, reward function, RL algorithm, etc are all implemented within the world, there exist possible policies that execute the strategies that result in i.e the reward function being bypassed and the output register being set to a large number, or the sensors being tampered with so everything looks good. This is the typical example of wireheading. Whether the RL algorithm successfully finds the policies that result in this behavior, it is the case that the global optima of the reward function on the set of all policies consists of these kinds of policies. Thus, for sufficiently powerful RL algorithms (not policies!), we should expect them to tend to choose policies which implement wireheading.

There are in fact many distinct possible policies with different behavioral objectives for the RL algorithm to select for: there is a policy that changes the world in the “intended” way so that the reward function reports a high value, or one that changes the reward function such that it now implements a different algorithm that returns higher values, or one that changes the register the output from the reward function is stored in to a higher number, or one that causes a specific transistor in the processor to misfire, etc. All of these policies optimize some thing in the outside world (a utility function); for instance, the utility function that assigns high utility to a particular register being a large number. The value of the particular register is a fact of the world. There even exists a policy that cares about that particular register at that particular physical location in the world even if it is not connected to the RL algorithm at all, but that policy does not get selected for by the RL algorithm.

However, when we try to construct an RL policy that has as its behavioral objective the “reward”, we encounter the problem that it is unclear what it would mean for the RL policy to “care about” reward, because there is no well defined reward channel in the embedded setting. We may observe that all of the above strategies are instrumental to having the particular policy be picked by the RL algorithm as the next policy used by the agent, but this is a utility over the world as well (“have the next policy implemented be this one”), and in fact this isn’t really much of a reward maximizer at all, because it explicitly bypasses reward as a concept altogether! In general, in an embedded setting, any preference the policy has over “reward” (or “observations”) can be mapped onto a preference over facts of the world.

Thus, whether the RL agent wireheads is a function of how powerful the RL algorithm is, how easy wireheading is for a policy to implement, and the inductive biases of the RL algorithm. Depending on these factors, the RL algorithm might favor the policies with mesaobjectives that care about the “right parts” of the world (i.e the thing we actually care about), or the wrong parts (i.e the reward register, the transistors in the CPU).

Humans as RL agents

We can view humans as being RL agents, and in particular consisting of the human reward circuitry (RL algorithm) that optimizes the rest of the brain (RL policy/​mesaoptimizer) for reward.

This resolves the question of why humans don’t want to wirehead—because you identify with the rest of the brain and so “your” desires are the mesaobjectives. “You” (your neocortex) know that sticking electrodes in your brain will cause reward to be maximized, but your reward circuitry is comparatively pretty dumb and so doesn’t realize this is an option until it actually gets the electrodes (at which point it does indeed rewire the rest of your brain to want to keep the electrodes in). You typically don’t give into your reward circuitry because your RL algorithm is pretty dumb and your policy is more powerful and able to outsmart the RL algorithm by putting rewards out of reach. However, this doesn’t mean your policy always wins against the RL algorithm! Addiction is an example of what happens when your policy fails to model the consequences of doing something, and then your reward circuitry kicks into gear and modifies the objective of the rest of your brain to like doing the addictive thing more.

In particular, one consequence of this is we also don’t need to postulate the existence of some kind of special as yet unknown algorithm that only exists in humans to be able to explain why humans end up caring about things in the world. Whether humans wirehead is determined by the same thing that determines whether RL agents wirehead.