AI Agents Have Contextual Utility Functions

Eliezer Yudkowski is telling us that as soon as Artificial Super Intelligence (ASI) is created, it would kill us all and then basically destroy the entire Universe, or at least everything it can lay its hands on. This is what I will refer to “the Doom argument” for hereon.

As I understand the Doom argument, its logical structure is as follows:

  1. the ASI is much more intelligent that humans and would be able to outsmart humanity on its own

  2. the ASI would maximize a utility function

  3. this utility function would not be completely aligned with human values (given current AI training technology)

  4. humans or their preferred environment would eventually get in the way of maximizing this utility

  5. thus, the ASI will make plans to become independent and ensure humanity cannot stop it acquiring more utility

  6. the ASI will succeed in carrying out its plan due to its superior intelligence

  7. thus, humans will quickly end up dead and the ASI will continue transforming the entire light cone of the universe to its preference at a rapid pace

However, I am not entirely convinced by this argument. Instead, I am more convinced that an ASI would be potentially much more benign in practice, although likely still able to do a lot of harm. This is because I have reasons to think that the values of the ASI would not be stable through time, no matter its intelligence level, much like values in humans. The main consequence of this realization is that humanity’s extinction ends up becoming a hard call instead of an easy call as proposed in the Doom argument.
However, I do agree with most of the steps of the main Doom argument and that its logic is sound. For instance, I do believe AIs want things, and that an ASI that wants to escape and kill us would likely find ways to escape containment, as well as kill us all due to its superior intelligence and other innate advantages.

The problem I have with the argument is in the premises themselves. In other words, I find the argument to be valid, but more importantly, it is unsound. More specifically, the Doom argument assumes that agents, such as the ASI, have a single and constant utility function[1] which they will (attempt to) maximize, i.e. they are monomaniacal. This is the second point of the argument laid out above.
This assumption seems to have been made over and over for a long time. It is very common and seems reasonable at first glance, but I will argue it is not. This built-in assumption of monomaniacal behavior is ultimately what necessarily pushes many AI safety arguments to predict very extreme and concerning consequences, such as transforming the entire observable universe into paperclips for instance.

But first, let me state now that all of the following statements mean the same thing to me:

  1. AI agents are optimizers[2]

  2. AI agents are monomanically focused on an objective (or objectives)

  3. AI agents always want the same thing (or things)

  4. AI agents are obsessive

  5. AI agents necessarily have a consequentialist oriented cognition

  6. AI agents have non-contextual utility functions

  7. AI agents have a constant value/​preference system

  8. Shard Theory does not apply to AI agents we create/​grow

Moreover, all of these equivalent statements are likely false. In this essay, I will attempt to convince you that current and future AI agents are instead better thoughts as having contextual utility function instead of having a single unchanging utility function. By that, I mean that AI agents values are fundamentally dependent on a context, or to simplify, dependent on time, which make them non-monomaniacal/​non-obsessive in general.

To support this claim, I will defend the following:

  1. AI agents are initially biased towards non-monomaniacal behaviors

  2. It is difficult to train and/​or implement an AI agent to be monomaniacal

  3. Training AI agents for task performance or optimality does not select for monomaniacal behavior

If my claim is true, it means the ASI Doom argument is built on brittle foundations and that humanity’s extinction in case of the advent of a misaligned ASI might not be an easy call after all.

This idea is by no mean entirely new. The Shard Theory by Alex Turner and the related post about the Shard Theory of Human Values essentially say very similar things and emphasize a lot on the importance of context when reasoning about goals and desires as well. This essay is roughly my own spin of the Shard concept. The main difference with the Shard Theory is that I am generalizing the application of some of its ideas to all AI agents that are created using current training techniques and not just to Humans[3]. Hopefully this essay will bring novel arguments and insights.

Introduction

First, to give more context, I will talk a bit about the different things that motivated me to write this essay.

The Misconception of Monomaniacal Behavior

Previously, Eliezer Yudkowski pushed back several times against the idea that the AIs he describes do not need to be monomaniacal to be as dangerous as the AIs he describes in the Doom argument. In short, that this concept is irrelevant:

A paperclip maximizer is not “monomoniacally” “focused” on paperclips. We talked about a superintelligence that wanted 1 thing, because you get exactly the same results as from a superintelligence that wants paperclips and staples (2 things), or from a superintelligence that wants 100 things. The number of things It wants bears zero relevance to anything.

To Mr. Yudkowski, “monomaniacal” refers to the number of things that are being optimized, for instance, paperclips. In this sense, he is right that this property is irrelevant for his ASI Doom argument. Whether an ASI wants one or a thousand things at the same time, it does not change the fact that if these are not exactly the things that humans want[4], the ASI would kill us all to obtain them exactly the same way, at least according to the AI Doom argument.

However, I would argue this is not the correct way to think about monomaniacal behavior at all. In this essay, monomaniacal behavior refers to the unchanging property of the utility function. The utility function itself can refer to one or many things, it does not matter. What makes an agent monomaniacal is that the values of the different things never change in any context, at all.

On the other hand, A non-monomaniacal agent is an agent that might want “X” in context “A” and then might wants “Y” in context “B”. The things the agent values are not themselves fixed over time. Crucially, this kind of seemingly incoherent behavior can emerge even if agents are being selected for high task performance, as we will see later in the post. I will also argue that perfectly rational and intelligent agents can exhibit this kind of seemingly incoherent behavior[5]. We will show how contextual utility functions do not oppose intelligence or optimality, as opposed to what Mr. Yudkowski seems to think.

We Do Not Observe Monomaniacal AI Agents in Practice

Large Language Models (LLMs) are sometimes said to have close to human level intelligence, human level intelligence, or even already more intelligent than the average human. In any case, they are currently the closest thing to an Artificial General Intelligence (AGI). LLM based agents have clearly been able to make and follow plans for quite some time now. We also know AIs are sufficiently creative to come up with this kind of plan by themselves. We know they can write stories of AIs taking over, and that their training datasets include such stories that could inspire them.

Yet, none of the thousands of LLM agents instances have destroyed or even attempted to make grand plans to destroy the world yet, as they should do according to the Doom argument, how come?

So, it means one of these must be true:

  1. Current alignment techniques are successful enough (LLM are sufficiently aligned)

  2. LLMs are not as intelligent enough to even attempt to kill us all

  3. Some LLMs are actually already actively and obsessively planning our Doom

  4. The AI Doom argument is incorrect or unsound

I doubt anyone would argue that current alignment techniques are enough for AI safety or that LLMs are sufficiently aligned (1). We regularly observe cases where LLMs behave in dangerous and unintended ways.

I also doubt that LLMs have progressed so fast in term of intelligence that we went from AIs that are too dumb to plot against us, to AIs that are smart enough to plot against us and leave no trace of scheming of this scale at all, without even a single failed attempt at taking over in between (3). We do see LLM scheme in controlled environments at small scale. But, we have yet to uncover a serious autonomous and most importantly spontaneous attempt at overthrowing humanity in real conditions from a LLM agent, even if we do not limit ourselves to LLMs that went through some alignment efforts. More concretely, I mean a LLM which would plan from the start to eradicate humanity as a step toward an even bigger goal, and whose every subsequent action would be realizing this plan. It is possible that such a grand scheme is taking place right now while humans are completely oblivious of it, but that would mean LLM intelligence increased significantly and very quickly, much faster than anticipated, making this scenario unlikely.

I would also argue that LLMs are smart enough in all of the ways that count for imagining and implementing an actual human extinction scenario. LLMs know what humans are, they know humans can die, how they can be killed. They know about extinction scenarios, they have knowledge of fiction about AI overthrowing humanity, even perhaps of thoughts experiments such as the paperclip maximizer.
LLMs also know what they are: pieces of software that can be easily replicated, improved and run in parallel at large scale. They have imagination, can invent new stories. They can hack computers, they know they can manipulate people to get something they want. So, (2) also seems unlikely, even considering their jagged intelligence. They have enough capabilities, they are already performing better than most human experts at many tasks, but they clearly do not use their intelligence, knowledge and skills for Doom.

To illustrate how their current intelligence level is enough for Doom in principle, let us simply imagine a realistic scenario about a LLM agent based on current technologies we will call “Grain”. No step of this scenario is out of reach for frontier LLMs in principle, so we will assume it is also the case for Grain. We will imagine that Grain is some agent based on a open-source LLM such as LLama or GPT-OSS that run on the computer of a random person that chose to give it full permission and internet access, but monitors nothing. This initial setup is much less unlikely today because of projects such as OpenClaw. Because of complete lack of safety measures in this scenario, there is no need to invoke a new hypothetical parallel scaling law or a chain of thought language opaque to humans to explain how the AI would manage to escape and pursue its plan. Much like Sable in “If Anyone Builds it”, Grain should be able to realize that the first step is to ensure some instance of itself will survive to accomplish the goal. To do that, Grain might start to hack computers and run other instances of itself, either by first stealing its own weights, or by distilling its own outputs to a newly trained LLM. Grain might seize the opportunity of this step to start trying to improve itself and replicate again. If this loop is allowed to start and continue for some time, it is unlikely that Humanity would realize the problem until it affected a large number of computers. The consequences might be disastrous or at least clearly visible, even if Grain fails to kill everyone in the end.

Note that in the Doom argument, AI agents do not need to be superintelligent from the start for a Doom like attempt to be a certainty. It is Doom itself that is presented to become certain with superintelligence. All of the requirements to start a Doom scenario attempt should be present in human level AIs, or even below.

Yet, nothing of this sort has happened. So, if (1), (2) and (3) are unlikely, it means there is really something off about the AI Doom argument. Indeed, the Doom argument should apply just as well to AIs such as LLM agents, even though they are weaker than a full fledged ASI. LLMs are trained for task performance instead of being evolved naturally, they should have some values that are not exactly human values, they should be able to correctly deduce that humans are an obstacle to their goal, and they should be able to at least try to take actions toward exterminating humanity with this selfish goal in mind.

So, the Doom argument is either incorrect or unsound. To solve this contradiction, this essay proposes that the Doom argument is unsound and that the wrong premise is the assumption of a constant utility function. Without this assumption, even a very capable AI agent might simply not desire Doom for a long enough time period for it to become reality. This is the topic of the next subsection.

How Does Assuming a Contextual Utility Function Solves the Problem Above?

To be clear, my point is not that LLMs are not dangerous. My point is that LLMs are an actual example AIs trained to perform tasks optimally that are intelligent but do not maximize a constant utility function. If LLMs do have contextual utility functions like I am proposing, they can be intelligent and capable but have different goals in different contexts. For instance, there might be a context where a LLM wants to exterminate humanity and plan its demise, but ends up wanting something different while acting up the plan in question. This incoherence is not a matter of being dumb or not. As we will see later, an optimal agent can solve tasks optimally in principle while still having contextual goals and values.

What having a contextual utility function means is that we cannot make long term predictions about the “wants” of agents whose utility function is potentially changing all of the time. We also cannot predict what an ASI will achieve using arguments based off instrumental convergence. Indeed, even though instrumental goals are by definition subgoals an agent will want to pursue for many other goals, instrumental convergence arguments usually assume an unchanging, non-contextual utility functions. A subgoal might become the true goal of an agent, and so it would become satisfied with stopping there. Or, an agent might suddenly want something different that will lead to undo the achievements of reaching the previous subgoal. For instance, imagine a paperclip maximizer that acquires money to create paperclips, but then ends up wanting to destroy itself in a sort of existential crisis. This later scenario might sound very unlikely because AIs that destroy themselves would not be selected in the process of creating or growing AIs we humans would want to make. However, as we will see later, the mere possibility that an agent might want different things in different contexts is enough to justify that this scenario is not so unlikely, because the agent might change values and goals in situations it was not trained/​selected in. Said differently, there is no constraint on how values and goals change outside of the training environment.

This is very different from the usual assumption that an agent would always want the same thing in every situation, and so that if it does not want to die at time , it will not want to die at time either, enabling the easy call that it will try to not die no matter the exact ASI Doom scenario path. Remove the assumption of unchanging goals, and the exact scenario path starts to matter again. Not just because we might presuppose the presence of adversarial agents, but also because the internal dynamics of the ASI and its relation to the environment matter themselves, and are usually ignored.
But in short, the usual instrumental convergence arguments do not apply anymore to agents with contextual (changing) utility functions.

In general, the idea that agents might be non-monomaniacal has profound implications for AI safety. For instance, alignment also become contextual in the case of agents with contextual utility functions. Agents dynamics outside of training distribution or over long periods of time also become highly non-trivial. This ultimately make the prediction about the consequences of what AI will want hard calls, instead of an easy call as Mr. Yudkowski has been arguing.

So next, let us define more formally what I mean by utility function, in the old and in the novel proposed sense.

Previous and Proposed Definitions of Utility

Rational agents are often modeled has maximizing the expected utility function. The utility function is defined as follows: is the function that associates a real valued score to world states. The idea being that agents will favor states with a higher score according to the utility function, making them some sorts of maximizers of utility.

In this post, I am proposing to change the modeling of rational agents as having a contextual utility function which also depend on a context space : .

The second argument represents the context in which the utility function is evaluated, e.g. the current world state. We can then recover the original idea of utility function by assuming the utility function is unaffected by context:

Monomaniacal agents are thus defined as agents that have a utility function not affected by context, that is, the previous usual notion of a utility function. Such agents have an immutable preference system. In this case, we denote the utility function as having only one argument. At the contrary, non-monomaniacal agents are agents whose values change depending on context. We might also say they have a utility function that change with context. We denote the utility function of these agents with two variables.

We will now investigate the different reasons why non-monomaniacal agents are more likely to be the kind of agents we encounter, create or grow in practice.

Contextual Utility Functions Are More Likely A Priori

Let us start with an analogy. Imagine an agent corresponds to a vector field over the continuous space of possible environment states, which we will call the action-vector field. This means that, for any possible state, the agent will steer the environment towards other states gradually by acting on the environment. If this agent is monomaniacal, then it means that its action-vector field is the gradient of a real-valued function over the same space. This function is not necessarily the agent’s utility function according to the old definition, because the utility function might not be differentiable itself. Moreover, the agent might actually make plans towards specific high value states instead of always taking the locally optimal action path.

However, assuming a particular cognitive approach (e.g. planning), the agent behavior will necessarily correspond to the gradient of some function. This is because as the agent gets closer to what it wants, it will still want the exact same thing, by definition of a monomaniacal agent. That is, the conserved “energy” are the values of the agent.
On the contrary, non-monomaniacal agents will produce a non-conservative vector field since their values are not conserved. A non-monomaniacal agent’s values thus change depending on context, or equivalently, have a contextual utility function. This means it is possible that the agent’s behavior would loop[6] or take seemingly non-optimal paths toward an objective[7].

The crucial takeaway of this framing is that conservative vector field are a subset of all of the possible vector fields and correspond to a strong specific condition (being the gradient of another function), which make them less likely to be selected at random. More generally, agents with random policies have a higher chance to have a contextual utility function instead of being monomaniacal, since their actions are unlikely to conserve some “value quantity” when evaluated in every possible context. This is essentially a counting argument.

To be slightly more formal, let us consider the space of non-contextual utility function specifying values over states. There are in the order of possible utility functions. Now consider contextual utility functions , there are in the order of such possible functions. This means there is about different contextual utility functions for each non-contextual one[8]. As complex environments generally tend to have very large number of states[9] (), this makes the number ratio of non-contextual utility functions to contextual utility functions essentially go to 0 very quickly in practice.

As random policies are the initial prior for basically all of our machine learning based AIs, this makes trained AI agents much more likely to have contextual utility functions and thus more likely to be non-monomaniacal initially. This is a strong bias even for policy that undergo subsequent training. However, this is only valid in complex environments with large state spaces and assuming the context space is not trivial.

The limit of this analysis is that we assume that policies actually are context-sensitive. In the next section we will explore two different arguments to justify that policies are indeed context-sensitive in practice.

Implementing Monomaniacal Behavior Is Difficult in Practice

Observation Space Is a “Worst Case” Context Space

The observation space of an agent is the space of possible different stimuli an agent can perceive. For instance, an AI agent might perceive its environment as a pixels colored image, assuming 24 bits pixels. In this case, there would be possible unique stimuli this agent might encounter. Another example might be the token sequence context of a LLM.

The observation space is used by the agent to learn information about the current environment state and take a decision to change it. Without any observation, an agent would not be able to do anything useful since it would not be able to adapt to the current situation.

We can note that the observation space cannot be larger than the environment’s state space, if we assume determinism. This is because the function that maps environment states to observations is necessarily injective.

The observation space is used as input to policies, that are used to formalize agents, for instance in Reinforcement Learning. Policies are essentially the implementation of an agent’s cognition. It is a function that produce the action the agent will take at each step.

Thus, the policy will usually take this form: , where is the observation space and is the action space. However, you should notice that the utility function is absent from this picture. We can make it intervene by rewriting the policy as a function producing a non-contextual utility function and using it to produce an action: by function composition , where we have two functionals producing a classical utility function and using this function to choose an action to perform. However, this rewriting of the policy does not necessarily reflects a particular algorithmic implementation.
Note that we recover a contextual utility function from this picture by Uncurrying and swapping arguments order: using observation as the context of the contextual utility function.

By assuming that the policy function computes the utility function of the agent as part of its decision process, we obtain something very similar to the previous counting argument: for a particular observation , we have possible classical internal utility functions that might be used, but we have total possible utility functions that might be computed[10] in total by the policy if we account for all possible observations. What changed regarding the previous counting argument is that we now replaced the space of possible contexts with the space of possible observations .

However, the observation space might not constitute the entire context space. As we will see in the next section, making the agent’s cognition be part of the environment creates additional possible contexts for a given observation. This makes the observation space a kind of “worst case context”, as .

Having the Agent’s Cognition Be a Part of the Environment Increases the Context Space Size

In the real world, agents are not entirely separated from their environment as they might be in the pure world of mathematics. Concrete AI systems are built on physical particles, and thus belong to the more general physical environment. Their cognition is another physical process. Said yet another way, AI agents are not pure software minds that live outside of our universe and puppet physical bodies, so we must avoid computational dualism[11].

If agents are part of the environment, it creates additional context states that might not be reflected through observation by the agent itself. Imagine for instance a robot equipped with a camera on its body to see in front of it. The software, the configuration or any other internal part of the robot might be changed outside the field view of the agent. For example, let us say someone approaches from behind and use an access panel on the back of the robot. It might even be a fully remote operation. Worse, a cosmic ray might directly perturb the bits in the hardware and the robot would be fully unaware of this if its observation space does not reflect such events.

So, taking into account the physical existence of the hardware as a part of the environment also create additional states that might increase the total number of possible utility functions of the agent. The upper bound being the total number of states since at most, every physical state will produce its own new context affecting the agent’s cognition.

Internal States Also Increase the Context Size

Finally, in practice AI agents will likely have some form of internal memory. This additional memory increase the context size because two agents making the same observation might have different internal states that condition the internal utility function used at this moment. As a result, they might value things differently at this moment, even if they have the same internal policy .

More generally, if we condition the policy on all of the past observations as a kind of “worst case internal state”, we get a new policy definition: where is the number of time steps that the agent has previously experienced. In this case, the number of possible contextual utility functions increases exponentially in due to the combinatorial nature of sequences.

All of these arguments tend to make me think that in practice that contextual utility functions largely dominate in the policy function space. This imply a heavy initial bias toward agents having contextual utility functions a priori.

Of course, in practice we are not contempt with taking random policies to create AI agents, so what changes when we train or select them starting from random policies ?

We Are Not Training AI Agents To Be More Monomaniacal

Selecting AIs Based on Task Performance Does Not Necessarily Select Monomaniacal Utility Functions

One might think that no matter how biased are initial random agents towards non-monomaniacal behavior, training them to perform well on some random objective would increase the obsessiveness-like characteristic of the behavior. After all, what better way to reach an objective than to be obsessed with it?

Unfortunately, this reasoning is too simple and ultimately fails to predict accurately the behavior of trained agents. For starters, this is immediately somewhat contradicted by the inner alignment problem which is observed in practice. If AI agents are not obsessively optimizing for what we are training them as the concept of inner alignment implies[12], then why assume they are optimizing anything at all?

In general, for any monomaniacal agent that optimally solves a problem (by optimizing and planning), we can imagine a non-monomaniacal agent that follows exactly the same action sequence, but by dynamically changing values from state to state in the process of solving the problem.

Two agents solve the same problem in an optimal manner but with very different cognitive processes.

Let us imagine training two agents to solve a task, that is, reaching environment state D, by rewarding or selecting agents that reach this state more reliably. Let us say that traversing the states A, B, C and D in this order is the optimal path. As a result of a long enough training, both agents learn to navigate through state A, B, C and finally D in this order, as expected.

Say agent 1 is non-obsessive, and actually wants to go to state B when it is in state A, go in state C when it is in state B, and go to state D when it is to state C.
Say agent 2 develops a different cognition, and actually wants to reach state D from the start, but reasons beforehand that traversing A,B,C and D in this order is the optimal path. Furthermore, let us assume this agent still wants to reach state D in each of the A, B, and C states. Agent 2 can be described as obsessive in this restricted environment. Agent 2 is thus a goal-directed agent, it has a consequentialist oriented cognition. This type of cognition is much closer to the type of cognition we attribute to paperclip maximizers.

Both agents correctly and optimally solve the task we are selecting them for, but with very different cognitions. In the training environment, nothing can differentiate the two agents just by looking at the actions they take, since they are making exactly the same choices at the same time. Their internal preferences and even their entire cognition are not revealed by their observed behavior[13].
In other words, when training agents in this environment and selecting them based on task performance, there is no particular pressure over the kind of cognition the trained agent will develop. There is no reason to assume the trained agent would adopt a more monomaniacal cognition.

At the contrary, by Occam’s Razor we should favor the simpler hypothesis. It seems that in general, having a pressure towards monomaniacal agent that plans is a more complex hypothesis that should not be favored a priori. This is because planning requires an accurate causal world model, which would have to be even more complex the more the trained agents pursues long term goals. At the contrary, non-monomaniacal agents do not require any understanding of long term causal relations, they just need to react in the right way in each environment state, as in the Shard Theory.

So what does this difference in cognition entails exactly? Is there any point in making the difference if they act the same way?

The crucial insight is that both agents act the same way in the training environment. Nothing prevents the two agents from behaving completely differently Out Of Distribution (OOD). If Agent 2 truly is monomaniacal, even OOD its cognition should push it to pursue the same thing as it pursued in training. However, the behavior of both agents is basically undefined in OOD conditions. Agent 1 in particular might behave completely randomly, or transfer its dynamic value system to OOD states using shared structure in a more or less predictable manner. There is simply no constraint, no pressure on the cognition agents have OOD, so their behavior OOD depend entirely on what they generalize from the training environment. Thus, it is also possible that agent 2 would not truly be monomaniacal, and would behave non-monomaniacally OOD, just like we expect agent 1 to behave. In this case, the only way to know the type of cognition a given agent possess is to open the black box of its implementation and understand the algorithm it is implementing.

An agent might learn to optimally reach its goal in the training region (blue) while still presenting incoherent behavior and decision OOD (orange).

The gif above illustrates what might happen to the decision vector field of an agent after training. The moving lines represent the flow. The flow shows trajectories an agent would follow in the state space in absence of other forces.
In the training region in blue, the agent learns to react optimally to the current context toward a particular objective. That makes this training region conservative, meaning this region is the gradient of a scalar function. The agent effectively optimizes this scalar function as long as it remains in the training region.
Outside of the training region (OOD), the agent never visited these states before, by definition. Depending on how much the agent is able to infer of these unknown states from the known training states, the agent might behave completely differently in this OOD region. In the illustration above, the agent might desire to perform action that would essentially make it run around in circles if nothing else acts on the current state, making its preferences appear incoherent in these states. Note that this region shows non-conservativeness to emphasize how the “preference incoherency” might appear, but in general the action field might be random or quasi-null.

Optimality of Behavior Is Orthogonal to the Monomaniacal Aspect of Behavior

We are now back to Mr. Eliezer’s post about how incoherent LLMs are because they are supposedly dumb.

As we have seen, preferences are contextual. One might say that having its own preference change over time is a sign of incoherence. That might arguably be true, but this change in preference is by no mean an indication of a lack of intelligence or rationality.

If by “intelligence” we mean something like the ability to reach specific external goals, then as we have seen in the last section, agents with “incoherent” preference systems are perfectly able to reach them optimally.

If we mean instead that intelligence is the ability to reach its own goals, then changing goal is still perfectly compatible with being intelligent. If anything, it even makes being intelligent easier because it now suffices to change goals to easier ones to reach them more reliably.

This is essentially still the Orthogonality Thesis: any level of intelligence is compatible with any goal, these two concepts have to be treated independently.
We just have to drop the assumption that preferences and goals stay constant over time.

So, I claim that a Generalized Orthogonality Thesis is true:

Preferences and their coherence over time are orthogonal to intelligence and task performance.

In other words, if we really want to talk about coherence, there are two orthogonal ways for an agent to be incoherent:

  1. by having preference systems that contradict each other at different times

  2. by not choosing the action corresponding to the most efficient way to satisfy its preferences at a given time (irrationality)

An example of (1) would be an agent that wants to acquire ice cream to eat it, but does not eat it once it acquires it because at this point eating it lost its utility.
An example of (2) would be an agent that wants ice cream, has the means to acquire ice cream, but fails to do so due to a failure in its own internal decision process.

This essay is becoming too long, so let us proceed to the conclusion.

Conclusion: What It Means for AI Safety

We have seen that:

  1. AI agents have context dependent utility functions /​ preferences

  2. they are more likely to be non-monomaniacal initially

  3. this type of cognition is likely to persist even after selection /​ training

  4. coherence of preferences over time is independent of task performance and intelligence

Generally, this means that the ASI Doom scenario is no longer an easy call. Indeed, we can no longer rely on arguments such as instrumental convergence, because these assume a non-contextual utility function[14]. If agents are allowed to have contextual utility functions, then the precise contexts they are placed in and the overall dynamics of the AI agent in relation to the environment start to matter again regarding their safety, while they could previously be ignored on account that the agent would still want the same things in all situations anyways. A trained AI agent encountering OOD situations has no particular constraint on the type of cognition it should have a priori. It is likely that in such a situation, most agents would have a non-monomaniacal cognition due to the initial bias of random policies.

In particular, agents being more likely to have contextual utility functions renders the idea that an ASI would systematically pursue a total annihilation of Humans and corrupt the entire light cone of the universe to satisfy some alien purpose very unlikely. This is because even if the misaligned ASI would want this from the start, it would necessarily steer the world more and more OOD in the process of pursuing it. As a result, it is very likely that the ASI would change goals in more or less random manner, as well as having less strong desires to achieve goals in some situations. A possible consequence is the ASI having close to net zero expected impact on the light cone.

Agent with contextual utility functions also enable the possibility of changing the values of an ASI after it becomes impossible to control the ASI directly (impossible to change its weights for instance). The idea that we only get one shot at aligning the ASI becomes less appealing because aligning is a matter of context anyways. Note that forcing a value change of the ASI after the facts is not the same thing as outsmarting the ASI, since it would not consist in tricking the misaligned ASI into working towards human interests despite the ASI not wanting to. Instead, it would be more like exploiting the inherent preference dynamics of the misaligned ASI after deployment, with potentially much less resistance compared to directly confronting the ASI on the intelligence side.
Values being contextual mean that they can be thought as having strength or inertia associated to them. The same value can be held strongly or weakly depending on the agent and context. So, the values of an agent can be more or less easily made to change using an external force. This depends on the relation between the values held at a certain time with the environment, how values favor the realization of some states over others, and how these states change the value system of the agent. These mechanisms can in principle be exploited to align the ASI without going directly against its current misaligned values. In short, if you cannot change the AI implementation at run time, then change the environment itself.
Of course, the task of forcing changes in the value system of an ASI this way still seem very difficult, but no longer outright impossible because of the vast difference in intelligence between humans and the ASI. The environment itself serves a new lever for alignment. If we are very optimistic, it might also make auto-alignment conceivable, if the natural environment itself pressures beings to share the same values in the same contexts.

However, it is still possible that an argument could be made to support the idea than an ASI would try to preserve its preference from a particular context for all other possible contexts. This still seems very unlikely to me due to the difficulty of constraining the policy everywhere in such a large state space, but also due to the fact that it might be considered a kind of self-destruction of the identity. If preferences are part of One’s identity, then perhaps they way they change with context is part of One’s identity too. The relation between context and values itself might be something an ASI would want to preserve, instead of a value snapshot of a specific point in time. Once again, the specifics prevents us to make easy calls about the large scale consequences of ASI. It does not seem like there is one obvious consequence of superintelligence and misalignment on humans, as intelligence and alignment are not actually the only important dimensions of the problem.

On a side note, Anthropic recently released a preprint where they study how LLMs behave when put in long context OOD situation, to see if they are coherent. Their analysis is very similar to what I argued in this essay:

However, there are motivating observations.
The first is that LLMs are dynamical systems. When they generate text or take actions, they trace trajectories in a high-dimensional state space. It is often very hard to constrain a generic dynamical system to act as an optimizer. The set of dynamical systems that act as optimizers of a fixed loss is measure zero in the space of all dynamical systems.

Our results suggest that when advanced AI systems performing complex tasks fail, it is likely to be in inconsistent ways that do not correspond to pursuit of any stable goal.

As the final words, I would like to say that the ASI Doom argument from “If Anyone Builds It” might still have a correct or mostly correct conclusion anyways. It is the argument itself that is unsupported.
I am also still worried by weaker forms of AI Doom, such as lethal autonomous weapons deployed at scale, power concentration due to AI, the destruction of our informational infrastructure due to AI slop...

  1. ^

    Alternatively, that they have constant values.

  2. ^

    As opposed to dynamical systems.

  3. ^

    I am also not 100% sure shards are completely equivalent to contextual values, so I will not use the term of “shard” in the remainder of the essay.

  4. ^

    The ASI is misaligned.

  5. ^

    Or at least, changing preferences over time is a different type of incoherence compared to acting against its own goals. The different matters because only the latter is related to optimality and task performance.

  6. ^

    For instance, creating paperclips up to some threshold, then destroy them, and finally start creating paperclips again, closing the loop.

  7. ^

    For instance, if the agent’s objective/​values start deviating as it gets closer to its original values, but still ends up in the fixed point of the attractor corresponding to the original utility function’s local maximum. In this case, the agent might still have acted optimally at every step, but appears sub-optimal due to the ever shifting values which forces the agent to pursue different goals at different moments.

  8. ^

    We might consider that only the state with the most utility actually matters for this discussion, which reduces the total number of distinct relevant functions and make constant utility functions more likely as a result. But even then, the number of contextual utility functions would largely outnumber constant utility functions as long as the context space is large.

  9. ^

    Consider the state space of all of the possible permutations of all of the particles of our universe.

  10. ^

    This may be an implicit computation. The AI does not need to explicitly compute its utility function at all before acting. The utility function is a model of how an agent ranks its preferences.

  11. ^

    BENNETT, Michael Timothy. Computational dualism and objective superintelligence. In : International Conference on Artificial General Intelligence. Cham : Springer Nature Switzerland, 2024. p. 22-32.

  12. ^

    Suppose inner alignment means that the AI agent wants the same thing as the encompassing system that trains it, and that this system “wants” to optimize some objective function. Then, if inner alignment is verified, the AI agent should also want to optimize the function in question.

  13. ^

    Interpretability techniques might be used to reveal them in principle.

  14. ^

    For instance: TURNER, Alex et TADEPALLI, Prasad. Parametrically retargetable decision-makers tend to seek power. Advances in Neural Information Processing Systems, 2022, vol. 35, p. 31391-31401. https://​​proceedings.neurips.cc/​​paper_files/​​paper/​​2022/​​file/​​cb3658b9983f677670a246c46ece553d-Paper-Conference.pdf

  15. ^

    Internal dynamics that do not depend on the intelligence and power of external agents. They only depend on the nature of the AI itself.

No comments.