Clarifying wireheading terminology

See also: Towards deconfusing wireheading and reward maximization, Everett et al. (2019).

There are a few subtly different things that people call “wireheading”. This post is intended to be a quick reference for explaining my views on the difference between these things. I think these distinctions are sometimes worth drawing to reduce confusion.

  1. Specification gaming /​ reward hacking: The agent configures the world in such a way that makes its reward(/​utility) function achieve a high value in a way that is not what the creators intended the reward function to incentivize.

    1. Examples: the boat race example, our recent work on reward model overoptimization

  2. Reward function input tampering: The agent tampers with the inputs to its reward function to make the reward function see the same inputs as it would when observing an actual desirable worldstate.

    1. Examples: sensor tampering, VR

  3. Reward function tampering: The agent changes its reward function.[1]

    1. Examples: meditation, this baba-is-you-like gridworld

  4. Wireheading: The agent directly tampers with the reward signal going into the RL algorithm.

    1. Examples: dopamine agonists, setting the reward register

Some crucial differences between these:

  • 1 is a problem of both embedded[2] and non-embedded settings. 2 is a problem of partially-observedness[3], which is unavoidable in embedded settings, but also comes up in many non-embedded settings as well. 3 and 4 are problems of embeddedness; they do not show up in non-embedded settings.

  • 2, 3, 4 are all about “not caring about the real world” in various ways, whereas 1 is about “caring about the real world in the wrong way”

  • 4 is about as close as you can get to a “reward-maximizing policy” in an embedded setting.

  1. ^

    Note: this is not the same thing as changing a terminal goal! The reward function is not necessarily the terminal goal of the policy, because of inner misalignment.

  2. ^

    Here, by “embeddedness” I mean in the sense of Demski and Garrabrandt (2019); in particular, the fact that the agent is part of the environment, and not in a separate part of the universe that only interacts through well defined observation/​action/​reward channels. RL is the prototypical example of a non-embedded agency algorithm.

  3. ^

    That is, the reward function is unable to perfectly observe the ground truth state of the environment, which means that if there is another state the world cold be in that yields the same observation, the reward cannot distinguish between the two.