*(edit: discussions in the comments section have led me to realize there have been several conversations on LessWrong related to this topic that I did not mention in my original question post. *

*Since ensuring their visibility is important, I am listing them here: Rohin Shah **has explained how consequentialist agents optimizing for universe-histories rather than world-states can display any external behavior whatsoever**, Steven Byrnes **has explored corrigibility in the framework of consequentialism by arguing poweful agents will optimize for future world-states at least to some extent**, Said Achmiz has explained what incomplete preferences look like (**1**, **2**, **3**), EJT **has formally defined preferential gaps and argued incomplete preferences can be an alignment strategy**, John Wentworth **has analyzed incomplete preferences through the lens of subagents** but **has then argued that incomplete preferences imply the existence of dominated strategies**, and Sami Petersen **has argued Wentworth was wrong by showing how incomplete preferences need not be vulnerable**.)*

In his first discussion with Richard Ngo during the 2021 MIRI Conversations, Eliezer retrospected and lamented:

In the end, a lot of what people got out of all that writing I did, was not the deep object-level principles I was trying to point to—they did not really get Bayesianism as thermodynamics, say, they did not become able to see Bayesian structures any time somebody sees a thing and changes their belief. What they got instead was something much more meta and general, a vague spirit of how to reason and argue, because that was what they’d spent a lot of time being exposed to over and over and over again in lots of blog posts.

Maybe there’s no way to make somebody understand why corrigibility is “unnatural” except to repeatedly walk them through the task of trying to invent an agent structure that lets you press the shutdown button (without it trying to force you to press the shutdown button), and showing them how each of their attempts fails; and then also walking them through why Stuart Russell’s attempt at moral uncertainty produces the problem of fully updated (non-)deference; and hope they can start to see the informal general pattern of why corrigibility is in general contrary to the structure of things that are good at optimization.

Except that to do the exercises at all, you need them to work within an expected utility framework. And then they just go, “Oh, well, I’ll just build an agent that’s good at optimizing things but doesn’t use these explicit expected utilities that are the source of the problem!”

And then if I want them to believe the same things I do, for the same reasons I do, I would have to teach them why certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples.

And I have tried to write that page once or twice (eg “coherent decisions imply consistent utilities”) but it has not sufficed to teach them, because they did not even do as many homework problems as I did, let alone the greater number they’d have to do because this is in fact a place where I have a particular talent.

Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level (“So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all”), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what *specific* actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the *general patterns *(such as Instrumental Convergence)* *of even an “alien mind” that’s sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility according to its world model to purse a goal that can be extremely different from what humans deem good.

When Eliezer says “they did not even do as many homework problems as I did,” I doubt he is referring to actual undergrad-style homework problems written nicely in LaTeX. Nevertheless, I would like to know whether there is *some* sort of publicly available repository of problem sets that illustrate the principles he is talking about. Meaning set-ups where you have an agent (of sorts) that is acting in a manner that’s either not utility-maximizing or even simply not consequentialist, followed by explanations of how you can exploit this agent. Given the centrality of consequentialism (and the associated money-pump and Dutch book-type arguments) to his thinking about advanced cognition and powerful AI, it would be nice to be able to verify whether working on these “homework problems” indeed results in the general takeaway Eliezer is trying to communicate.

I am particularly interested in this question in light of EJT’s thorough and thought-provoking post on how “There are no coherence theorems”. The upshot of that post can be summarized as saying that “there are *no* theorems which state that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue strategies that are dominated by some other available strategy” and that “nevertheless, many important and influential people in the AI safety community have *mistakenly and repeatedly* promoted the idea that there are such theorems.”

I was not a member of this site at the time EJT made his post, but given the large number of upvotes and comments on his post (123 and 116, respectively, at this time), it appears likely that it was rather popular and people here paid some attention to it. In light of that, I must confess to finding the general community reaction to his post rather baffling. Oliver Habryka wrote in response:

The post does actually seem wrong though.

I expect someone to write a comment with the details at some point (I am pretty busy right now, so can only give a quick meta-level gleam), but mostly, I feel like in order to argue that something is wrong with these arguments is that you have to argue more compellingly against completeness and possible alternative ways to establish dutch-book arguments.

However, the “details”, as far as I can tell, have *never* been written up. There was one other post by Valdes on this topic, who noted that “I have searched for a result in the literature that would settle the question and so far I have found none” and explicitly called for the community’s participation, but constructive engagement was minimal. John Wentworth, for his part, wrote a nice short explanation of what coherence looks like in a toy setting involving cache corruption and a simple optimization problem; this was interesting but not quite on point to what EJT talked about. But this was it; I could not find any other posts (written after EJT’s) that were even tangentially connected to these ideas. Eliezer’s own response was dismissive and entirely inadequate, not really contending with any of the arguments in the original post:

Eliezer: The author doesn’t seem to realize that there’s a difference between representation theorems and coherence theorems.

Cool, I’ll complete it for you then.

Transitivity: Suppose you prefer A to B, B to C, and C to A. I’ll keep having you pay a penny to trade between them in a cycle. You start with C, end with C, and are three pennies poorer. You’d be richer if you didn’t do that.

Completeness: Any time you have no comparability between two goods, I’ll swap them in whatever direction is most useful for completing money-pump cycles. Since you’ve got no preference one way or the other, I don’t expect you’ll be objecting, right?

Combined with the standard Complete Class Theorem, this now produces the existence of at least one coherence theorem. The post’s thesis, “There are no coherence theorems”, is therefore falsified by presentation of a counterexample. Have a nice day!

In the limit, you take a rock, and say, “See, the complete class theorem doesn’t apply to it, because it doesn’t have any preferences ordered about anything!” What about your argument is any different from this—where is there a powerful, future-steering thing that isn’t viewable as Bayesian and also isn’t dominated?

As EJT explained in detail,

EJT: These arguments don’t work. [...] As I note in the post, agents can make themselves immune to all possible money-pumps for completeness by acting in accordance with the following policy: ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ Acting in accordance with this policy need never require an agent to act against any of their preferences. [...]

This whole situation appears very strange to me, as an outsider; isn’t this topic important enough to merit enough of an analysis that gets us beyond saying (in Habryka’s words) “it does *seem* wrong” to “it’s *actually* wrong, here’s the math that proves it”? I tried quite hard to find one, and was not able to. Given that coherence arguments are still crucial argumentative building blocks of the case made by users here that AI risk should be taken seriously (and that the general format of these arguments has remained unchanged), it leaves me with the rather uncanny impression that EJT’s post was seen by the community, acknowledged as important, yet never truly engaged with, and essentially… forgotten, or maybe ignored? It doesn’t seem like it has changed anyone’s behavior or arguments despite no refutation of it having appeared. Am I missing something important here?

This going to be a somewhat-scattered summary of my own current understanding. My understanding of this question has evolved over time, and is therefore likely to continue to evolve over time.

## Classic Theorems

First, there’s all the classic coherence theorems—think Complete Class or Savage or Dutch books or any of the other arguments you’d find in Stanford Encyclopedia of Philosophy. The general pattern of these is:

Assume some arguably-intuitively-reasonable properties of an agent’s decisions (think e.g. lack of circular preferences).

Show that these imply that the agent’s decisions maximize some expected utility function.

I would group objections to this sort of theorem into three broad classes:

Argue that some of the arguably-intuitively-reasonable properties are not actually necessary for powerful agents.

Be confused about something, and accidentally argue against something which is either not really what the theorem says or assumes a particular way of applying the theorem which is not the only way of applying the theorem.

Argue that all systems can be modeled as expected utility maximizers (i.e. just pick a utility function which is maximized by whatever the system in fact does) and therefore the theorems don’t say anything useful.

For an old answer to (2.a), see the discussion under my mini-essay comment on Coherent Decisions Imply Consistent Utilities. (We’ll also talk about (2.a) some more below.) Other than that particularly common confusion, there’s a whole variety of other confusions; a few common types include:

Only pay attention to the VNM theorem, which is relatively incomplete as coherence theorems go.

Attempt to rely on some notion of preferences which is not revealed preference.

Lose track of which things the theorems say an agent has utility and/or uncertainty

over, i.e. what the inputs to the utility and/or probability functions are.## How To Talk About “Powerful Agents” Directly

While I think EJT’s arguments specifically are not quite right in a few ways, there is an importantly correct claim close to his: none of the classic coherence theorems say “powerful agent → EU maximizer (in a nontrivial sense)”. They instead say “<list of properties which are not obviously implied by powerful agency> → EU maximizer”. In order to even start to make a theorem of the form “powerful agent → EU maximizer (in a nontrivial sense)”, we’d first need a clean intuitively-correct mathematical operationalization of what “powerful agent” even means.

Currently, the best method I know of for making the connection between “powerful agency” and utility maximization is in Utility Maximization = Description Length Minimization. There, the notion of “powerful agency” is tied to optimization, in the sense of pushing the world into a relatively small number of states. That, in turn, is equivalent (the post argues) to expected utility maximization. That said, that approach doesn’t explicitly talk about “an agent” at all; I see it less as a coherence theorem and more as a likely-useful

pieceof some future coherence theorem.What would the rest of such a future coherence theorem look like? Here’s my current best guess:

We start from the idea of an agent optimizing stuff “far away” in spacetime. Coherence of Caches and Agents hints at why this is necessary: standard coherence constraints are only substantive when the utility/”reward” is not given for the immediate effects of local actions, but rather for some long-term outcome. Intuitively, coherence is inherently substantive for long-range optimizers, not myopic agents.

We invoke the Utility Maximization = Description Length Minimization equivalence to say that optimization of the far-away parts of the world will be equivalent to maximization of some utility function over the far-away parts of the world.

We then use basically similar arguments to Coherence of Caches and Agents, but generalized to operate on spacetime (rather than just states-over-time with no spatial structure) and allow for uncertainty.

## Pareto-Optimality/Dominated Strategies

There are various claims along the lines of “agent behaves like <X>, or else it’s executing a pareto-suboptimal/dominated strategy”.

Some of these are very easy to prove; here’s my favorite example. An agent has a fixed utility function and performs pareto-optimally on that utility function across multiple worlds (so “utility in each world” is the set of objectives). Then there’s a normal vector (or family of normal vectors) to the pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector’s components will all be nonnegative (because pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is “optimizing relatively harder” for utility in those worlds. This says nothing at all about how the agent will

update, and we’d need a another couple sentences to argue that the agent maximizesexpectedutility under the distribution, but it does give the prototypical mental picture behind the “pareto-optimal → probabilities” idea.The most fundamental and general problem with pareto-optimality-based claims is that “pareto-suboptimal” implies that we already had a set of quantitative objectives in mind (or in some cases a “measuring stick of utility”, like e.g. money). But then some people will say “ok, but what if a powerful agent just isn’t pareto-optimal with respect to any resources at all, for instance because it just produces craptons of resources and then uses them inefficiently?”.

(Aside: “‘pareto-suboptimal’ implies we already had a set of quantitative objectives in mind” is also usually the answer to claims that all systems can be represented as expected utility maximizers. Sure, any system can be represented as an expected utility maximizer which is pareto-optimal with respect to some made-up objectives/resources which we picked specifically for this system. That does not mean all systems are pareto-optimal with respect to money, or energy, or other resources which we actually care about. Or, if using Utility Maximization = Description Length Minimization to ground out the quantitative objectives: not all systems are pareto-optimal with respect to optimization of some stuff far away in the world. That’s where the nontrivial content of most coherence theorems comes from: the quantitative objectives with respect to which the agent is pareto-optimal need to be things we care about for some reason.)

## Approximate Coherence

What if a powerful agent just isn’t pareto-optimal with respect to any resources or far-away optimization targets at all? Or: even if you do expect powerful agents to be approximately pareto-optimal, presumably they will be

approximatelypareto optimal, notexactlypareto-optimal. What can we say about coherence then?To date, I know of no theorems saying anything at all about approximate coherence. That said, this looks like more a case of “nobody’s done the legwork yet” rather than “people tried and failed”. It’s on my todo list.

My guess is that there’s a way to come at the problem with a thermodynamics-esque flavor, which would yield

globalbounds, for instance of roughly the form “in order for the system to apply n bits of optimization more than it could achieve with outputs independent of its inputs, it must observe at least m bits and approximate coherence to within m-n bits” (though to be clear I don’t yet know the right ways to operationalize all the parts of that sentence). The simplest version of a theorem of that form doesn’t work, but David and I have played with some variations and have some promising threads.I remember reading the EJT post and left some comments there. The basic conclusions I arrived at are:

The transitivity property

isactually important and necessary, one can construct money-pump-like situations if it isn’t satisfied. See this commentIf we keep transitivity, but not completeness, and follow a strategy of not making choices inconsistent with out previous choices, as EJT suggests, then we no longer have a single consistent utility function. However, it looks like the behaviour can still be roughly described as “picking a utility function at random, and then acting according to

thatutility function”. See this comment.In my current thinking about non-coherent agents, the main toy example I like to think about is the agent that maximizes some combination of the entropy of its actions, and their expected utility. i.e. the probability of taking an action a is proportional to exp(βE[U|a]) up to a normalization factor. By tuning β we can affect whether the agent cares more about entropy or utility. This has a great resemblance to RLHF-finetuned language models. They’re trained to both achieve a high rating and to not have too great an entropy with respect to the prior implied by pretraining.

Note that if the distribution of utility under the prior is heavy-tailed, you can get infinite utility even with arbitrarily low relative entropy, so the optimal policy is undefined. In the case of goal misspecification, optimization with a KL penalty may be unsafe or get no better utility than the prior.

I’m coming to this two weeks late, but here are my thoughts.

The question of interest is:

Will sufficiently-advanced artificial agents be representable as maximizing expected utility?

Rephrased:

Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?

Coherence arguments purport to establish that the answer is yes. These arguments go like this:

There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.

Sufficiently-advanced artificial agents will not pursue dominated strategies.

So, sufficiently-advanced artificial agents will be representable as maximizing expected utility.

These arguments don’t work, because premise 1 is false: there are no theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. In the year since I published my post, no one has disputed that.

Now to address two prominent responses:

‘I define ‘coherence theorems’ differently.’In the post, I used the term ‘coherence theorems’ to refer to ‘theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ I took that to be the usual definition on LessWrong (see the Appendix for why), but some people replied that they meant something different by ‘coherence theorems’: e.g. ‘theorems that are relevant to the question of agent coherence.’

All well and good. If you use that definition, then there are coherence theorems. But if you use that definition, then coherence theorems can’t play the role that they’re supposed to play in coherence arguments. Premise 1 of the coherence argument is still false. That’s the important point.

‘The mistake is benign.’This is a crude summary of Rohin’s response. Rohin and I agree that the Complete Class Theorem implies the following: ‘

If an agent has complete and transitive preferences, then unless the agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ So the mistake is neglecting to say ‘If an agent has complete and transitive preferences…’ Rohin thinks this mistake is benign.I don’t think the mistake is benign. As my rephrasing of the question of interest above makes clear, Completeness and Transitivity are a major part of what coherence arguments aim to establish! So it’s crucial to note that the Complete Class Theorem gives us no reason to think that sufficiently-advanced artificial agents will have complete or transitive preferences, especially since:

Completeness doesn’t come for free.

Money-pump arguments for Completeness (applied to artificial agents) aren’t convincing.

Money-pump arguments for Transitivity assume Completeness.

Training agents to violate Completeness might keep them shutdownable.

Two important pointsHere are two important points, which I make to preclude misreadings of the post:

Future artificial agents—trained in a standard way—might still be representable as maximizing expected utility.

Coherence arguments don’t work, but there might well be other reasons to think that future artificial agents—trained in a standard way—will be representable as maximizing expected utility.

Artificial agents not representable as maximizing expected utility can still be dangerous.

So why does the post matter?The post matters because ‘train artificial agents to have incomplete preferences’ looks promising as a way of ensuring that these agents allow us to shut them down.

AI safety researchers haven’t previously considered incomplete preferences as a solution, plausibly because these researchers accepted coherence arguments and so thought that agents with incomplete preferences were a non-starter.

^{[1]}But coherence arguments don’t work, so training agents to have incomplete preferences is back on the table as a strategy for reducing risks from AI. And (I think) it looks like a pretty good strategy. I make the case for it in this post, and my coauthors and I will soon be posting some experimental results suggesting that the strategy is promising.As I wrote elsewhere:

I find the money pump argument for completeness to be convincing.

The rule that you provide as a counterexample (Caprice rule) is one that gradually completes the preferences of the agent as it encounters a variety of decisions. You appear to agree with that this is the case. This isn’t a large problem for your argument. The big problem is that when there are lots of random nodes in the decision tree, such that the agent

mightencounter a wide variety of potentially money-pumping trades, the agent needs to complete its preferences in advance, or risk its strategy being dominated.You argue with John about this here, and John appears to have dropped the argument. It looks to me like your argument there is wrong, at least when it comes to situations where there are sufficient assumptions to talk about coherence (which is when the preferences are over final outcomes, rather than trajectories).

I take the ‘lots of random nodes’ possibility to be addressed by this point:

Can you explain why you think that doesn’t work?

To elaborate a little more, introducing random nodes allows for the possibility that the agent ends up with some outcome that they disprefer to the outcome that they would have gotten (as a matter of fact, unbeknownst to the agent) by making different choices. But that’s equally true of agents with complete preferences.

I intended for my link to point to the comment you linked to, oops.

I’ve responded here, I think it’s better to just keep one thread of argument, in a place where there is more necessary context.

I guess I just don’t see it as a weak point in the doom argument that goal-orientedness is a convergent attractor in the space of self-modifying intelligences?

It feels similar to pondering the familiar claim of evolution, that systems that copy themselves and seize resources are an attractor state. Sure it’s not 100% proven but it seems pretty solid.

This is kind of baffling to read, particularly in light of the statement by Eliezer that I quoted at the very beginning of my post.

If the argument is (and indeed it is) that “many superficially appealing solutions like corrigibility, moral uncertainty etc are in general contrary to the structure of things that are good at optimization” and the way we see this is

by doing homework exercises within an expected utility framework, and the reason why we must choose an EU framework is because “certain structures of cognition are the parts of the agent that are good at stuff and do the work, rather than them being this particular formal thing that they learned for manipulating meaningless numbers as opposed to real-world apples,”becauseagents which don’t maximize expected utility are always exploitable, it seems quite straightforward thatif it isn’t true that these agents are exploitable, then the entire argument collapses.Of course it doesn’t mean the conclusion is now wrong, but you need some other reason for reaching that conclusion than the typical money pumps and Dutch books that were being offered up as justifications.

This also requires a citation, or at the very least some reasoning; I’m not aware of any theorems that show goal-orientedness is a convergent attractor, but I’d be happy to learn more.

If the reason why you think this is true is because of intuitions about what powerful cognition must be like, but the

sourceof those intuitions was the set of coherence arguments that are being discussed in this question post, then learning the coherence arguments do not extend as far as they were purported to should cause you to rethink those intuitions and the conclusions you had previously reached on their basis, as they are now tainted by that confusion.Sure, it seems solid, and it also seems plausible that formalizing this should be straightforward for an expert in the domain. I’m not sure why this is a good analogy to the topic of agentic behavior and cognition.

Ok here’s my reasoning:

When an agent is goal-oriented, they want to become more goal-oriented, and maximize the goal-orientedness of the universe with respect to their own goal. So if we diagram the evolution of the universe’s goal-orientedness, it has the shape of an attractor.

There are plenty of entry paths where some intelligence-improving process spits out a goal-oriented general intelligene (like biological evolution did), but no exit path where a universe whose smartest agent is super goal-oriented ever leads to that no longer being the case.

Because expected value tells us that the more resources you control, the more robust you are to maximizing your probability of success in the face of what may come at you, and the higher your maximum possible utility is (if you have a utility function without an easy-to-hit max score).

“Maximizing goal-orientedness of the universe” was how I phrased the prediction that conquering resources involves having them aligned to your goal / aligned agents helping you control them.