Locality of goals
Studying goal-directedness produces two kinds of questions: questions about goals, and questions about being directed towards a goal. Most of my previous posts focused on the second kind; this one shifts to the first kind.
Assume some goal-directed system with a known goal. The nature of this goal will influence which issues of safety the system might have. If the goal focuses on the input, the system might wirehead itself and/or game its specification. On the other hand, if the goal lies firmly in the environment, the system might have convergent instrumental subgoals and/or destroy any unspecified value.
Locality aims at capturing this distinction.
Intuitively, the locality of the system’s goal captures how far away from the system one must look to check the accomplishment of the goal.
Let’s give some examples:
The goal of “My sensor reaches the number 23” is very local, probably maximally local.
The goal of “Maintain the temperature of the room at 23 °C” is less local, but still focused on a close neighborhood of the system.
The goal of “No death from cancer in the whole world” is even less local.
Locality isn’t about how the system extract a model of the world from its input, but about whether and how much it cares about the world beyond it.
This intuition about locality came from the collision of two different classification of goals: the first from from Daniel Dennett and the second from Evan Hubinger.
Thermostats and Goals
In “The Intentional Stance”, Dennett explains, extends and defends… the intentional stance. One point he discusses is his liberalism: he is completely comfortable with admitting ridiculously simple systems like thermostats in the club of intentional systems—to give them meaningful mental states about beliefs, desires and goals.
Lest we readers feel insulted at the comparison, Dennett nonetheless admits that the goals of a thermostat differ from ours.
Going along with the gag, we might agree to grant [the thermostat] the capacity for about half a dozen different beliefs and fewer desires—it can believe the room is too cold or too hot, that the boiler is on or off, and that if it wants the room warmer it should turn on the boiler, and so forth. But surely this is imputing too much to the thermostat; it has no concept of heat or of a boiler, for instance. So suppose we de-interpret its beliefs and desires: it can believe the A is too F or G, and if it wants the A to be more F it should do K, and so forth. After all, by attaching the thermostatic control mechanism to different input and output devices, it could be made to regulate the amount of water in a tank, or the speed of a train, for instance.
The goals and beliefs of a thermostat are thus not about heat and the room it is in, as our anthropomorphic bias might suggest, but about the binary state of its sensor.
Now, if the thermostat had more information about the world—a camera, GPS position, general reasoning ability to infer information about the actual temperature from all its inputs --, then Dennett argues its beliefs and goals would be much more related to heat in the room.
The more of this we add, the less amenable our device becomes to serving as the control structure of anything other than a room-temperature maintenance system. A more formal way of saying this is that the class of indistinguishably satisfactory models of the formal system embodied in its internal states gets smaller and smaller as we add such complexities; the more we add, the richer or more demanding or specific the semantics of the system, until eventually we reach systems for which a unique semantic interpretation is practically (but never in principle) dictated (cf. Hayes 1979). At that point we say this device (or animal or person) has beliefs about heat and about this very room, and so forth, not only because of the system’s actual location in, and operations on, the world, but because we cannot imagine an-other niche in which it could be placed where it would work.
Humans, Dennett argues, are more like this enhanced thermostat, in that our beliefs and goals intertwine with the state of the world. Or put differently, when the world around us changes, it will influence almost always influence our mental states; whereas a basic thermostat might react the exact same way in vastly different environments.
But as systems become perceptually richer and behaviorally more versatile, it becomes harder and harder to make substitutions in the actual links of the system to the world without changing the organization of the system itself. If you change its environment, it will notice, in effect, and make a change in its internal state in response. There comes to be a two-way constraint of growing specificity between the device and the environment. Fix the device in any one state and it demands a very specific environment in which to operate properly (you can no longer switch it easily from regulating temperature to regulating speed or anything else); but at the same time, if you do not fix the state it is in, but just plonk it down in a changed environment, its sensory attachments will be sensitive and discriminative enough to respond appropriately to the change, driving the system into a new state, in which it will operate effectively in the new environment.
Part of this distinction between goals comes from generalization, a property considered necessary for goal-directedness since Rohin’s initial post on the subject. But the two goals also differs in their “groundedness”: the thermostat’s goal lies completely in its sensors’ inputs, whereas the goals of humans depend on things farther away, on the environment itself.
That is, these two goals have different locality.
Goals Across Cartesian Boundaries
The other classification of goals comes from Evan Hubinger, in a personal discussion. Assuming a Cartesian Boundary outlining the system and its inputs and outputs, goals can be functions of:
The environment. This includes most human goals, since we tend to refuse wireheading. Hence the goal depends on something else than our brain state.
The input. A typical goal as a function of the input is the one ascribed to the simple thermostat: maintaining the number given by its sensor above some threshold. If we look at the thermostat without assuming that its goal is a proxy for something else, then this system would happily wirehead itself, as the goal IS the input.
The output. This one is a bit weirder, but captures goals about actions: for example, the goal of twitching. If there is a robot that only twitches, not even trying to keep twitching, just twitching, its goal seems about its output only.
The internals. Lastly, goals can depend on what happens inside the system. For example, a very depressed person might have the goal of “Feeling good”. If that is the only thing that matters, then it is a goal about their internal state, and nothing else.
Of course, many goals are functions of multiple parts of this quatuor. Yet separating them allows a characterization of a given goal through their proportions.
Going back to Dennett’s example, the basic thermostat’s goal is a function of its input, while human goals tend to be functions of the environment. And once again, an important aspect of the difference appears to lie in how far from the system is there information relevant to the goal—locality.
What Is Locality Anyway?
Assuming some model of the world (possibly a causal DAG) containing the system, the locality of the goal is inversely proportional to the minimum radius of a ball, centered at the system, which suffice to evaluate the goal. Basically, one needs to look a certain distance away to check whether one’s goal is accomplished; locality is a measure of this distance. The more local a goal, the less grounded in the environment, and the most it is susceptible to wireheading or change of environment without change of internal state.
Running with this attempt at formalization, a couple of interesting point follow:
If the model of the world includes time, then locality also captures how far in the future and in the past one must go to evaluate the goal. This is basically the short-sightedness of a goal, as exemplified by variants of twitching robots: the robot that simply twitches; the one that want to maximize its twitch in the next second; the one that want to maximize its twitching in the next 2 seconds,… up to the robot that want to maximize the time it twitches in the future.
Despite the previous point, locality differs from the short term/long term split. An example of a short-term goal (or one-shot goal) is wanting an ice cream: after its accomplishment, the goal simply dissolves. Whereas an example of a long-term goal (or continuous goal) is to bring about and maintaing world peace—something that is never over, but instead constrains the shape of the whole future. Short-sightedness differs from short-term, as a short-sighted goal can be long-term: “for all times t (in hours to simplify), I need to eat an ice cream in the interval [t-4,t+4]”.
Where we put the center of the ball inside the system is probably irrelevant, as the classes of locality should matter more than the exact distance.
An alternative definition would be to allow the center of the ball to be anywhere in the world, and make locality inversely proportional to the sum of the distance of the center to the system plus the radius. This captures goals that do not depend on the state of the system, but would give similar numbers than the initial definition.
In summary, locality is a measure of the distance at which information about the world matters for a system’s goal. It appears in various guises in different classification of goals, and underlies multiple safety issues. What I give is far from a formalization; it is instead a first exploration of the concept, with open directions to boot. Yet I believe that the concept can be put into more formal terms, and that such a measure of locality captures a fundamental aspect of goal-directedness.
Thanks to Victoria Krakovna, Evan Hubinger and Michele Campolo for discussions on this idea.