Goals and short descriptions

Outline

I develop some contents—previously introduced in the Value Learning sequence by Rohin Shah—more formally, to clarify the distinction between agents with and without a goal. Then I present related work and make some considerations on the relation between safety and goal-directedness. The appendix contains some details on the used formalism and can be skipped without losing much information.

A brief preliminary

In the first post of the Value Learning sequence, Shah compares two agents that exhibit the same behaviour (a winning strategy) when playing Tic-Tac-Toe, but are different in their design: one applies the minimax algorithm to the setting and rules of the game, while the other one follows a lookup table—you can think of its code as a long sequence of if-else statements.

Shah highlights the difference in terms of generalisation: the first one would still win if the winning conditions were changed, while the lookup table would not. Generalisation is one of the components of goal-directedness, and lookup tables are among the least goal-directed agent designs.

Here I want to point at another difference that exists between agents with and without a goal, based on the concept of algorithmic complexity.

Setup

Most problems in AI consist in finding a function $π \in A^{O}$ , called policy in some contexts, where $A = {a_{1}, \dots, a_{m}}$ and $O = {o_{1}, \dots, o_{n}}$ indicate the sets of possible actions and observations. A deterministic policy can be written as a string $π = a_{i_{1}} a_{i_{2}} \dots a_{i_{n}}$ with $a_{i_{k}}$ indicating the action taken when $o_{k}$ is observed.

Here I consider a problem setting as a triplet $(A, O, D)$ where $D$ stands for some kind of environmental data—could be about, for example, the transition function in a MDP, or the structure of the elements in the search space $O$ . Since I want to analyse behaviour across different environments, instead of considering one single policy I’ll sometimes refer to a more general function $g$ (probably closer to the concept of “agent design”, rather than just “agent”) mapping environments to policies on them: $(A, O, D) g \mapsto π_{(A, O, D)}$ .

Lookup table vs simple-reward RL

As written before, a lookup table is a policy that is described case-by-case, by explicitly giving the list of all observation-action pairs. Thus, a generic lookup table for a setting $(A, O, D)$ is expected to have Kolmogorov complexity $K (π | D) \approx | O |$ : most lookup tables are incompressible.

Now let’s consider a policy generated via Reinforcement Learning. Such a policy can be written as $π = RL (p (D))$ , where $RL$ is the training algorithm, and $p$ indicates an algorithmic procedure that gets the environmental data $D$ as input and returns a reward function $r : O \to [- 1; 1]$ to be used by the RL algorithm. Often, $p$ corresponds to the “work” done by the human designers who assign appropriate rewards to states, according to what they want the agent to achieve in the environment.

However, any policy could actually be expressed in the form $π = RL (r)$ with an appropriately constructed reward function $r$ , because of an argument analogous to the one that Shah showed in this other post. The relevant element for goal-directedness is the algorithmic complexity, compared to the environment size, of the reward function.

If the algorithmic procedure $p$ has low complexity, then I expect that

K (g (A, O, D) | D) = K (π | D) = K (RL | D) + K (p | D) ≪ | O |

especially in large environments. As an example for this case, consider the policy that is the result of RL training on a maze with the same small negative reward assigned to every state except for the exit, which has reward $1$ instead. For this reward function, $K (p | D$ ) is low, since the procedure $p$ is only about recognising the exit square of the maze given as input $D$ . Moreover, the same exact procedure can be used to find an exiting policy in larger mazes.

The fact that low algorithmic complexity is a sign of goal-directed behaviour has an analogy in natural language. When we say the goal of an agent is to exit mazes, we are giving a short description of its policies across multiple environments, no matter how big they are. In other words, we are expressing the policies in a compressed form by using the simple reward function, which coincides with the goal of the agent.

On the other hand, consider a reward function that assigns lots of different values to all states in the maze, with no recognisable pattern. In this case, the algorithmic complexity of the reward function is approximately equal to the size of the observation set: the situation is the same as with incompressible lookup tables.

In the analogy with natural language, this corresponds to the fact that the only way of describing such policies would be to state what the action (or the reward) is for each observation. There would be no goal which we can use to give a short description of the agent behaviour.

The lookup table vs simple-reward RL contrast appears under different forms in other contexts as well. Consider search processes (in this case, $π \in {0, 1}^{O}$ ) that select one or more elements in a search space $O$ and take as input some data $D$ about the elements. Manually specified filters, that for each search space list which elements to take and which to discard, are generally as complex as the size of the search space, thus essentially identical to lookup tables and without a recognisable goal.

On the other hand, a search algorithm that uses the data $D$ to generate an ordering of the elements according to a simple evaluating function, and takes the best element, is succinctly described by the evaluating function itself. Moreover, the latter search process naturally extends to larger environments, while the filter needs one more specification for each element that is added to the search space.

Related work

The previous example with search processes smoothly leads to the consideration of an alternative formalism to standard optimisation: quantilizers. Briefly, instead of taking the best element in the ordering generated by the evaluation function, an element is chosen from the top $q$ portion of the same ordering, according to a probability distribution.

If a standard optimisation process, like the previous example, is described as $π = best (eval (D))$ , then a quantilizer can be expressed similarly as $π \in {top}_{q} (eval (D))$ , with a negligible change in complexity.

In terms of goal-directedness, this corresponds to the fact that the two agents can be said to have more or less the same goal, since they are interested in the same property captured by the evaluating function. The difference lies in the degree of “directedness”: the first agent applies straightforward hard optimisation, while the quantilizer is, in a sense, more relaxed and possesses some safety properties (e.g. Lemma 1 in the original paper).

A similar comparison in the context of learning can be done between the optimal policy maximising a simple reward function, and a policy that does the same $1 - q$ of the time but takes a default action $q$ of the time—something like waiting for new human instructions. These two agents have the same goal, but they pursue it differently, since the second one leaves more room for corrections.

Overall, when comparing agents with the same goal, it seems that less-directed agents trade a certain amount of performance for a gain in terms of safety.

Compression of complex policies is also listed as a favourable condition for mesa-optimisation (see section 2 in the paper). When searching for algorithms that can solve a certain class of problems, a bias in favour of short policies increases the likelihood that an algorithm which is itself an optimiser is chosen; ideally, if the bias for simplicity is strong enough, it may be possible to find an algorithm that generalises to problems outside of the original class. Unsurprisingly, such an algorithm is also more likely to be goal-directed.

Implications for safety

When the full policy of the agent can be described as the result of a relatively short algorithm applied to some short data representing the goal, as in the case of $π = RL (p (D))$ , we can interpret $p$ as a compressor that selects the goal-related data from all the environmental data $D$ , and $RL$ as a decompressor that uses the goal-related data to generate the full policy. Thus, because of error propagation, we should expect arbitrarily large errors in the full policy if there is any kind of error in the specification of the goal-related data.

This slightly formal reasoning reflects the less formal argument that, if we design a powerful goal-directed AI supposed to act in large and varied environments, and we make a mistake in the specification of the goal, the resulting policy could be arbitrarily far from what we originally expected or wanted.

A fundamental safety question is whether it’s possible to design safe agents that have goals and act in the real world. We’ve seen before that, among agents with the same goal, some are safer than others, but it seems hard to tell if this is enough to avoid all possible bad outcomes.

Note, however, that even if “standard” goal-directed agents were unsafe, an alternative solution could be to design agents that still use some compression, but also have a certain amount of explicitly-specified ad-hoc behaviour for unique scenarios that wouldn’t be handled correctly otherwise. Such an agent would be an intermediate design between the two extremes shown before, i.e. lookup tables and agents with a clear goal.

Thanks to Adam Shimi, Joe Collman and Sabrina Tang for feedback.

This work was supported by CEEALAR (EA hotel).

Appendix

The environmental data $D$ is underspecified and not completely formal, but the main ideas in the post should be clear enough anyway, so that it’s easy to criticise them or suggest new research directions.
Even though I showed the simpler case with deterministic policies over states, the reasoning is the same with stochastic policies over histories.
The use of Kolmogorov complexity actually requires fixing a Universal Turing Machine: the above analysis doesn’t change at its core. You can think of all mentioned algorithms as if they were written in the same programming language.
Kosoy has proposed a definition of goal-directed intelligence that also uses algorithmic complexity, but in a different way.