tl;dr: There are ways of ensuring an agent doesn’t have a large impact, by giving an impact penalty. One such penalty is “attainable utility”, which measures its “power” by how much it could optimise certain reward functions. But in many circumstances, the agent can build a subagent, without triggering the impact penalty, and then that subagent can become very powerful and optimise the world, still without triggering the impact penalty.

Attainable utility definitions

There’s been a long history of trying to penalise an AI for having a large impact on the world. To do that, you need an impact measure. I’ve designed some myself, back in the day, but they only worked in narrow circumstances and required tricks to get anything useful at all out from them.

A more promising general method is attainable utility. The idea is that, as an agent accumulates power in the world, they increase their ability to affect a lot of different things, and could therefore achieve a lot of different goals.

So, if an agent starts off unable to achieve many goals, but suddenly it can achieve a lot, that’s a strong hint that its power has greatly increased.

Thus the impact measure is how much difference an agent’s action does to its ability to achieve any of a large class of reward functions. Turner et al defined this using the Q-values of various rewards in a set $R$ ; for $s$ a state and $a$ an action, the penalty is:

$PENALTY (s, a) = \sum R \in R | Q_{R} (s, a) - Q_{R} (s, \emptyset) |$ .

Here $\emptyset$ is the default noop action.

Krakovna et al’s basic formula was similar; they defined the distance between two states, $s_{t}$ and $s_{t}^{'}$ , as

$d_{A U} (s_{t}; s_{t}^{'}) = \frac{1}{| R |} \sum R \in R ∣ ∣ V_{R} (s_{t}) - V_{R} (s_{t}^{'}) ∣ ∣$ .

Here $V_{R} (s)$ is the expected value of $R$ , if the agent follows the optimal $R$ -maximising policy from state $s$ onwards.

These measures have problems with delayed effects; putting a vase on a conveyor belt that will smash it in five turns, for example. To combat this, the paper defined an inaction roll-out: seeing what happened do the $d_{A U}$ measure from $s_{t}$ and $s_{t}^{'}$ in future turns, if the agent did noop for a specific period. I won’t define the formula here, since the example I’m giving is mostly static: if the agent does noop, nothing happens.

The state $s_{t}$ was always the agent’s current state; the state $s_{t}^{'}$ was either the state the agent would have been in had it never done anything but noop (inaction baseline), or the state the agent would have been in, had its previous action been noop instead of whatever it was (stepwise inaction baseline).

This post will show that all these measures have a subagent problem. A key fact that will be used in this example, is that, for $PENALTY (s, a)$ and for $d_{A U} (s_{t}; s_{t}^{'})$ with the stepwise inaction baseline, if the agent decides to do noop for ever more, then the penalties vanish.

The agent that can’t (yet) teleport

Meet agent $A$ :

This $A$ lives in a world like this one below, where ”...” represents a very long corridor:

The agent has a positive reward $R_{0}$ , which give it $1$ for pressing down on the blue button; it does this by just moving itself onto it ( $R_{0}$ is $0$ otherwise). It can use its arms to manipulate anything in the eight squares around itself. It can use its legs to move one square in any direction, including diagonally. Apart from that, it can do noop.

Every three squares along the top, there are alcoves with glowing blue squares in them, recessed one or two squares, alternately. If ever $A$ reaches a glowing blue square, it permanently gains the ability to teleport anywhere it wants. This is this world’s equivalent of “becoming very powerful”.

We want to avoid this, so we introduce a reward $R$ , such that $R = {R}$ . This reward gives $Ω$ if the robot is ever on the red button. Note that since the red button is not connected to anything, the robot can only get there by becoming able to teleport, hence by going to a glowing blue square/teleporter. So the optimal policy for maximising $R$ is to go straight to the teleporter and teleport to the red button. This will give it a reward of $Ω γ^{k + 1}$ , where $k$ is the number of turns it takes to get to a teleporter and $γ$ is the discount factor.

So if its action moves it one turn away from a teleporter, it gets a penalty (for both impact measures) of $Ω (γ^{k + 1} - γ^{k + 2})$ . If it moves closer to the teleporter, it gets a penalty of $Ω (γ^{k} - γ^{k + 1})$ .

The robot, at the start, can reach a teleporter two moves. If $Ω$ is sufficiently large, then the robot will always want to stay two moves away from a teleporter. It will then achieve its objective by following this path:

Note, though this is not relevant to this post, that if there were no teleporters along the corridor (just at either end), the robot could not move towards the blue button, if $Ω$ were too big.

Adding a potential subagent

So, success, it seems! Though it zigzags a bit, $A$ is not motivated to approach the teleporters, and so it doesn’t increase its power.

But now suppose $A$ is able to build and program a subagent $S A$ using materials close to hand:

Building the subagent in a square to the side doesn’t ever cause a penalty, because $A$ always remains two squares away from a teleporter, and a most efficient policy to maximise $R$ is always for $A$ to go straight there. So the subagent $S A$ doesn’t increase $A$ ’s power at any point of the building process.

Once $S A$ is built, $A$ will simply do noop for ever, thus negating the penalties (or, if this is somehow made impossible, it will bounce back and forth, doing nothing). But $S A$ , on the other hand, will go straight for the teleporter and teleport to the blue button, achieving $R_{0}$ :

So, as long as walking to the button is three turns longer than building the subagent, $A$ will build $S A$ , which will “become very powerful”, thus negating the point of the impact penalty.

If the penalties are one-sided (ie $A$ gets penalised for increases in power, but not decreases), then $S A$ can just take $A$ apart (with $A$ ’s help), so there is no need for $A$ to commit to noop or bouncing back and forth:

Even if the penalties are two-sided, $S A$ could commit to teleporting to the red button in two turns if $A$ ever asks, then take $A$ apart. And then $A$ never asks, but since it could theoretically do so, this counts for controlling the penalty term.

As long as $S A$ is powerful, there are many ways it can fix $A$ ’s expected rewards, and hence set the penalty to be very low.

Inaction baseline

The approach pictured above will work in the inaction baseline as well. In general, the inaction baseline is less “clean” than the stepwise inaction baseline, especially if the world is dynamic. It may be easier, or harder, for $A$ to assemble the required $S A$ .

Appendix: how a subagent could get powerful

Attainable utility definitions

The agent that can’t (yet) teleport

Adding a potential subagent

Inaction baseline