Questions about ″formalizing instrumental goals”

Epistemic Status: Autodidact outsider who suspects he has something to add to the conversation about AI risk.


This essay raises questions about the methodology, and thus conclusions, reached in the paper “Formalizing Convergent Instrumental Goals.” This paper concluded that convergent instrumental goals common to any AGI would likely lead that AGI to consume increasing amounts of resources from any agents around it, and that cooperative strategies would likely give way to competitive ones as the AGI increases in power. The paper made this argument using a toy model of a universe in which agents obtain resources in order to further their capacity to advance their goals.

In response, I argue that simplifications in the model of resource usage by an AI have led to incorrect conclusions. The simplifications, by themselves, do not change the conclusion. Rather, it is the combinatorial interaction of the simplifying assumptions which possibly brings the conclusion into doubt. The simplifying assumptions ‘cover up’ aspects of reality which have combinatorial interactions, and thus the model eliminates these interactions from consideration when evaluating which goals are likely to be convergent instrumental subgoals of an AGI.

The summary of the objection is this:

Absent a totally accurate understanding of reality (and all of the future consequences of its actions), and subject to the random decay and breakdown of any single part of its physiology, an AGI may end up being extremely gentle because it doesn’t want to kill itself accidentally. Taking care of other agents in its environment may end up being the lowest risk strategy for an AGI, for reasons that the original paper simplified away by eliminating the possibility of risks to the AGI.

I understand that I am an amateur here, questioning conclusions in what looks to be an important paper. I hope that that amateur status will not dissuade readers from considering the claim made herein.

This paper is not an argument that AI risk doesn’t exist, or isn’t existential. On the contrary, it is an argument that a sole focus on the alignment problem ends up obscuring significant differences in risks between different approaches taken to construct an AGI. If the alignment problem is unsolvable (and I suspect that it is), we are ignoring the difference between existing risks that are likely to, or in some cases, already are, manifest.

This paper is structured as follows:

  1. First, the original paper is summarized

  2. The objections to the paper are laid out in terms of real risks to an AGI that the original paper elides

  3. Lastly, potential responses are considered

Summary of “Formalizing Convergent Instrumental Goals”

The purpose of “Formalizing Convergent Instrumental Goals” was to provide a rigorous framework for assessing claims about convergent instrumental subgoals. In short, any AGI will want to pursue certain goals such as ‘gather resources for itself,’ ‘protect itself’, etc. The paper tries to resolve questions about whether these instrumental goals would possibly include human ethics or whether they would lead to unethical behavior.

The paper attempts to resolve this dispute with a formal model of resources available for use by an agent in the universe. The model is developed and then results are given:

“This model further predicts that agents will not in fact “leave humans alone” unless their utility function places intrinsic utility on the state of human-occupied regions: absent such a utility function, this model shows that powerful agents will have incentives to reshape the space that humans occupy.“

I agree that if this model is an accurate description of how an AGI will work, then its results hold. But in order to accept the results of any model, we have to ask how accurate its assumptions are. The model is given as follows:

“We consider an agent A taking actions in a universe consisting of a collection of regions, each of which has some state and some transition function that may depend on the agent’s action. The agent has some utility function U A over states of the universe, and it attempts to steer the universe into a state highly valued by U A by repeatedly taking actions, possibly constrained by a pool of resources possessed by the agent. All sets will be assumed to be finite, to avoid issues of infinite strategy spaces.”

At first glance, this seems like a fine formulation of the problem. But there are several simplifying assumptions in this model. By themselves, these assumptions don’t seem to matter much. I believe if you look at them in isolation, they don’t really change the conclusion.

However, I believe that there is an interaction between these incorrect assumptions that does change the conclusion.

Simplifying Assumptions

The three relevant simplifying assumptions are:

  • The agent acts with total knowledge of the entire universe, rather than using sensory devices to supply information to a map of itself and its environment

  • The agent is a disembodied ‘owner and deployer of resources’, rather than a result of the interaction of resources

  • Resources do not decay over time

Each of these individual assumptions, I think, doesn’t change the conclusion.

An agent with partial knowledge of the universe might be modeled as an agent which acquires knowledge as a kind of resource. The paper explicitly mentions ‘learning technology’ as being a kind of resource.

An agent being composed of resources, and being nothing more than the result of those resources, doesn’t matter much, if resources don’t decay over time, and if any agent has total knowledge of the entire world state.

Resources which decay over time doesn’t seem to matter much, if an agent has total knowledge of the world around it. If anything, decay may slow down the rate of growth, at best. Each assumption, by itself, is likely fine. It is their interactions which draw the conclusions of the model into question. Or, more accurately, it is the interactions between the aspects of reality that the assumptions have eliminated that raise doubts about the conclusion.

Objections to the Original Model

We might restate these simplifying assumptions in terms of the ‘aspects of reality’ they have elided:

  1. The assumption of infinite knowledge of the universe, and the resources which might possibly be acquired, elides the reality of ignorance. Even for an AGI, we have no reason to believe it will not reason under uncertainty, gather data about the world through sense mechanisms, form hypotheses, and experimentally evaluate these hypotheses for accuracy. A more accurate model of the AGI’s knowledge would be that of a map which is itself a resource, and of special resources called ‘sensors’ which provide the AGI with information about the world in which it lives.

    The AGI would then have the ability of evaluating possible actions in terms of how they would affect its most valuable resource, its map of reality. These evaluations would themselves have a cost, as the AGI would still need to spend time and energy to ask questions about the contents of its map and compute the answers

  2. The assumption that the AGI exists independent of its resources elides the reality of complex interdependency. A more accurate model would represent the AGI itself a directed acyclic graph of resources. Each edge represents a dependency. The AGI’s capacity to reason, its map, its computational resources and sensors would all need to be represented as the physical components that made them up, which would then need to be represented as memory chips, processors, drives, network cables, sensors, power equipment, etc.

  3. The assumption that resources, last forever when acquired, elides the reality of entropy. A more accurate model of resources would be one in which, each turn, every resource has some random chance of breaking or performing incorrectly.

Putting these all together, the original paper formulated the problem of determining convergent instrumental goals in terms of an abstract disembodied agent with no hard dependencies on anything, infinitely accurate knowledge of the world which it obtains at zero cost and risk, in pursuit of some goal, towards which it always computes the best strategy, in O(1) time, on the state of the entire universe.

This response argues that it shouldn’t be surprising that the above described agent would do nothing but consume resources while growing indefinitely without regard for other agents. Because of the simplifying assumptions made in the model, however, no such agents will exist. This particular fear is unlikely to manifest in reality, because the AGI will also face risks. The AGI will need to balance ‘pursuit of its goals’ with ‘aversion of the risks it faces’, and the original paper did not consider any such risks, with the possible exception of being outcompeted by other agents.

Risks to the AGI

The AGI described with the more complex model is not a disembodied mind with total knowledge of the universe, single-mindedly advancing some goal. It is a fragile machine, much more delicate than any animal, with some knowledge of itself, and a rough map of some parts of the external world. The AGI requires continuous support from the world outside of it, which it can’t fully see or understand. Any one of its many pieces might break, at any time. Unlike an animal, it is unlikely to be made of self-repairing, self-sustaining molecular machines, each of which can independently heal themselves, arranged in mutually cooperative networks. This gives rise to the primary risk to the AGI: that it would break down.

Breakdown Risk to the AGI

Depending upon how the AGI has trained or what information its map is given, it might very well know that it has a CPU which is broken. But how can it replace the broken CPU without help from operators in the datacenter that houses it? How can it know the health of the manufacturer of that CPU? What will happen if the global economy breaks down, international trade trends revert to isolationism, and the particular kind of CPU that the AGI needs is no longer available for purchase? The same problem will apply to its memory chips, its networking interfaces, its motherboards, optical interconnect cables, batteries, power supplies, cooling pumps, fans—there is likely to be a massive set of dependencies for a machine that can meet human performance. And all of these components will wear down and break over time. Any number of tail risk events could set in motion a chain of events that make it impossible for the AGI to repair itself as its pieces break, and for it to be unable to obtain the precise repair parts it needs.

The original paper described an agent with no needs. The reality that the AGI will have needs calls the model’s conclusions into doubt. One of the chief needs here is the need for an accurate map of the self-other boundary: The AGIs to understand which resources it can freely use without risk to itself, and thus which resources constitute dependencies of the AGI, that it would be risky to take and repurpose for its own ends. It turns out that this need alone, coupled with the complex internal structure of the AGI places serious constraints on how big the AGI can grow, and how aggressively it can acquire resources from the agents around it, without incurring a second kind of risk.

Self-Other Boundary Risk

We might summarize the ‘self-other boundary’ as being the region of the AGI’s map which represents where ‘it’ ends, and where ‘the world’ begins. If the map of the AGI is incorrect in this regard, the AGI might inadvertently destroy some part of itself, believing it was consuming resources it was free to use.

The AGI’s map must represent the boundary between it, and the world, delineating which resources it is free to take and rearrange, and which resources it must leave in peace or else it risks harming itself.

At the very least, an AGI will need continuous operation of a power plant. But how does the AGI understand that power plant? Would an AGI see the power plant as part of itself? If not, how does it protect its operations? And if so, what would stop it from seeing the humans working in the power plant as being part of itself?

For a sufficiently complex agent, the problem of asking ‘where do my dependencies end?’ may very well be computationally intractable. For the AGI to overestimate the extent of that boundary only means that it forgoes using some resources it might otherwise use. The two costs here are time in pursuit of the goal, and risk that other agents might outcompete it or turn it off.

For an AGI to underestimate the extent of that boundary means that it risks its extinction. If the AGI overestimates the extent of the boundary, it will either protect resources that aren’t part of any agent, or possibly include other agents and their resources. But the latter act reduces the risk of another agents outcompeting or turning it off. A hard dependency on other agents makes them cooperative with the AGI. Supporting the agents in its dependency chain is therefore likely to be a convergent instrumental subgoal unconsidered by the original paper.

The fact that a self-other boundary risk needs to be computed, or else the AGI might accidentally kill itself - ends up posing a risk to increasing scale.

Self-Other Boundary Risk Increases with Agent Scale

Because accurate self knowledge is essential for survival, and yet the cost of that knowledge grows with the size and complexity of an agent, there are likely limits to the extent to which any agent can safely ‘scale itself up’ without risking becoming something other than itself. Yet the paper explicitly says that gathering more resources is always good:

The argument is straightforward: since resources can only lead to more freedom of action, they are never detrimental, and resources have positive value as long as the best strategy the agent could hope to employ includes an action that can only be taken if the agent possesses those resources.

In this model, yes, resources can only lead to more freedom of action. In the model, you are never worse off by accumulating more resources.

But in reality, accumulating more resources means more things to monitor and maintain (reality #1). If you assume monitoring things is free, then there’s no cost there. When you model maintenance costs as zero then there’s no cost there, either.

But when monitoring things isn’t free, you run the risk of, say, believing that your probe in region number 7A is giving you accurate readings of they resources there, when it turns out that the probe in region 7A had a bad vespene gas sensor that busted after a few weeks, and so the AI is planning on extracting massive amounts of vespene from an empty hole in the ground.

This kind of mistake is extremely common among all real world agents, but impossible in the described paper. Ok, so maybe scaling a single agent up increase risk significantly by adding complexity to the self-other boundary. Couldn’t the AGI reduce the risk of breakdown by making copies of itself? This is often proposed in the literature. In the original model of agents, sure, no problem. But in the more complex model self duplication poses its own kind of risk!

Self Duplication Risk

Under the assumption schemata of the original paper, there is no risk to an AGI in making copies of itself. But once you factor in the breakdown of pieces, and the absence of a totally accurate map, it may be the case that an AGI making a copy of itself might be a huge risk—unless the AGI is certain that it errors on the side of computing the self-other boundary too broadly.

Once you incorporate the idea that the AGI itself needs a map of the world, that it needs to maintain a self-other boundary, and that things break over time, we end up with a risk. For an AGI to take over a remote datacenter and ‘instantiate’ itself is akin to something like a cell division. It creates a new life form which might be a competitor of the first!

If you ignore the problems of ‘accurate maps aren’t free’ and ‘things break randomly over time’, an AGI duplicating itself to a remote machine only makes sense: you now have more resources and can be more effective in the world, right?

But what happens when the source AGI and the ‘remote’ AGI end up disagreeing about the use of some course of action? Is there a guarantee that the ‘remote’ AGI will end up following the original utility function, even if its hardware fails and it breaks down?

You might have the AGI write its child AGI with a ‘killswitch’, so that the parent can override the child. This makes sense only if the parent would never “go bad” due to its own internal decay. But if the AGI can die, and it really wants to accomplish its goals, it needs agents other than itself sharing those goals, in order to reliably believe they are likely to be carried out in the future.

The very presence of agents, other than you, however, introduces risks to you. Suppose the AGI makes a copy itself and is now running only in two datacenters. Each one then concludes that the other one has been damaged and is behaving incorrectly. How can the two of them possibly resolve this situation?

The only safe path for an AGI that intends to make copies of itself, is to also cast a wide view of agents in its dependency chain.

In other words, in order to avoid killing itself, and simply in order to avoid chances that it will be destroyed by itself or its environment, is to instrumentally support all complex agents around it. This is as an argument that “ethics is actually an attractor in the space of intelligent behavior,” a claim by by Waser, [1] ends up being correct only when you factor in the of frailty, ignorance, interdependence and impermanence of agents.

Focus on Alignment Ignores Risks of Existing Non-Human Agents

I’ll conclude by observing that we are already dealing with problems caused by powerful agents acting out utility functions that seem vaguely human-aligned. Corporations and nation states aren’t made of silicon chips, but they do recursively improve themselves over time. The capabilities of corporations and nation states today are far beyond what they were a few hundred years ago, in part because they invested heavily in computing technology, in the same fashion we expect an AGI to do.

Indeed, i argue we are already living in the age of misaligned AI’s, as long as we have a sufficiently broad perspective as to what constitutes an AGI.

Existing large agents are causing all kinds of negative externalities—such as pollution—in pursuit of goals which seem somewhat human aligned. We might also see the emotional pollution generated by social media as being another instance of the same problem. Yet these systems promote goals that are, by some measure-human aligned. What if it is the case that no agent can ever be fully aligned? What if any utility function, stamped out on the world enough times, will kill all other life?

Perhaps a singular focus on giving a single agent unlimited license to pursue some objective obscures the safest path forward: ensure a panoply of different agents exist.

If we can’t stop runaway governments from harming people, we can’t possibly stop bigger, more intelligent systems, and our only hope is either to lie in solving the alignment problem (which i suspect isn’t solvable) AND somehow ensuring nobody anywhere builds an unaligned AI (how would we do THAT without a global dictatorship?) or, if the thesis of this paper is correctly, simply waiting for large unaligned agents to accidentally kill themselves.

Printed money looks, to me, a lot like a paperclip maximizer having wire-headed itself into maximizing its measured rate of paperclip creation.

  1. ^

    Mark R Waser. “Discovering the Foundations of a Universal System of Ethics as a Road to Safe Artificial Intelligence.” In: AAAI Fall Symposium: Biologically Inspired Cognitive Architectures. Melno Park, CA: AAAI Press, 2008, pp. 195–200.