# johnswentworth comments on Coherent decisions imply consistent utilities

• ## Things To Take Away From The Essay

First and foremost: Yudkowsky makes absolutely no mention whatsoever of the VNM utility theorem. This is neither an oversight nor a simplification. The VNM utility theorem is not the primary coherence theorem. It’s debatable whether it should be considered a coherence theorem at all.

Far and away the most common mistake when arguing about coherence (at least among a technically-educated audience) is for people who’ve only heard of VNM to think they know what the debate is about. Looking at the top-voted comments on this essay:

• the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model

• the second argues that two of the VNM axioms are unrealistic

I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.

So what are the primary coherence theorems, and how do they differ from VNM? Yudkowsky mentions the complete class theorem in the post, Savage’s theorem comes up in the comments, and there are variations on these two and probably others as well. Roughly, the general claim these theorems make is that any system either (a) acts like an expected utility maximizer under some probabilistic model, or (b) throws away resources in a pareto-suboptimal manner. One thing to emphasize: these theorems generally do not assume any pre-existing probabilities (as VNM does); an agent’s implied probabilities are instead derived. Yudkowsky’s essay does a good job communicating these concepts, but doesn’t emphasize that this is different from VNM.

One more common misconception which this essay quietly addresses: the idea that every system can be interpreted as an expected utility maximizer. This is technically true, in the sense that we can always pick a utility function which is maximized under whatever outcome actually occurred. And yet… Yudkowsky gives multiple examples in which the system is not a utility maximizer. What’s going on here?

The coherence theorems implicitly put some stronger constraints on how we’re allowed to “interpret” systems as utility maximizers. They assume the existence of some resources, and talk about systems which are pareto-optimal with respect to those resources—e.g. systems which “don’t throw away money”. Implicitly, we’re assuming that the system generally “wants” more resources, and we derive the system’s “preferences” over everything else (including things which are not resources) from that. The agent “prefers” X over Y if it expends resources to get from X to Y. If the agent reaches a world-state which it could have reached with strictly less resource expenditure in all possible worlds, then it’s not an expected utility maximizer—it “threw away money” unnecessarily.

(Side note: as in Yudkowsky’s hospital-administrator example, we need not assume that the agent “wants” more resources as a terminal goal; the agent may only want more resources in order to exchange them for something else. The theorems still basically work, so long as resources can be spent for something the agent “wants”.)

Of course, we can very often find things which work like “resources” for purposes of the theorems even when they’re not baked-in to the problem. For instance, in thermodynamics, energy and momentum work like resources, and we could use the coherence theorems to talk about systems which don’t throw away energy and/​or momentum in a pareto-suboptimal manner. Biological cells are a good example: presumably they make efficient use of energy, as well as other metabolic resources, therefore we should expect the coherence theorems to apply.

## Some Problems With (Known) Coherence Theorems

Financial markets are the ur-example of inexploitability and pareto efficiency (in the same sense as the coherence theorems). They generally do not throw away resources in a pareto-suboptimal manner, and this can be proven for idealized mathematical markets. And yet, it turns out that even an idealized market is not equivalent to an expected utility maximizer, in general. (Economists call this “nonexistence of a representative agent”.) That’s a pretty big red flag.

The problem, in this case, is that the coherence theorems implicitly assume that the system has no internal state (or at least no relevant internal state). Once we allow internal state, subagents matter—see the essay “Why Subagents?” for more on that.

Another pretty big red flag: real systems can sometimes “change their mind” for no outwardly-apparent reason, yet still be pareto efficient. A good example here is a bookie with a side channel: when the bookie gets new information, the odds update, even though “from the outside” there’s no apparent reason why the odds are changing—the outside environment doesn’t have access to the side channel. The coherence theorems discussed here don’t handle such side channels. Abram has talked about more general versions of this issue (including logical uncertainty connections) in his essays on Radical Probabilism.

An even more general issue, which Abram also discusses in his Radical Probabilism essays: while the coherence theorems make a decent argument for probabilistic beliefs and expected utility maximization at any one point in time, the coherence arguments for how to update are much weaker than the other arguments. Yudkowsky talks about conditional probability in terms of conditional bets—i.e. bets which only pay out when a condition triggers. That’s fine, and the coherence arguments work for that use-case. The problem is, it’s not clear that an agent’s belief-update when new information comes in must be equivalent to these conditional bets.

Finally, there’s the assumption that “resources” exist, and that we can use trade-offs with those resources in order to work out implied preferences over everything else. I think instrumental convergence provides a strong argument that this will be the case, at least for the sorts of “agents” we actually care about (i.e. agents which have significant impact on the world). However, that’s not an argument which is baked into the coherence theorems themselves, and there’s some highly nontrivial steps to make the argument.

## Side-Note: Probability Without Utility

At this point, it’s worth noting that there are foundations for probability which do not involve utility or decision theory at all, and I consider these foundations much stronger than the coherence theorems. Frequentism is the obvious example. Another prominent example is information theory and the minimum description length foundation of probability theory.

The most fundamental foundation I know of is Cox’ theorem, which is more of a meta-foundation explaining why the same laws of probability drop out of so many different assumptions (e.g. frequencies, bets, minimum description length, etc).

However, these foundations do not say anything at all about agents or utilities or expected utility maximization. They only talk about probabilities.

## Towards A Better Coherence Theorem

As I see it, the real justification for expected utility maximization is not any particular coherence theorem, but rather the fact that there’s a wide variety of coherence theorems (and some other kinds of theorems, and empirical results) which all seem to point in a similar direction. When that sort of thing happens, it’s a pretty strong clue that there’s something fundamental going on. I think the “real” coherence theorem has yet to be discovered.

What features would such a theorem have?

Following the “Why Subagents?” argument, it would probably prove that a system is equivalent to a market of expected utility maximizers rather than a single expected utility maximizer. It would handle side-channels. It would derive the notion of an “update” on incoming information.

As a starting point in searching for such a theorem, probably the most important hint is that “resources” should be a derived notion rather than a fundamental one. My current best guess at a sketch: the agent should make decisions within multiple loosely-coupled contexts, with all the coupling via some low-dimensional summary information—and that summary information would be the “resources”. (This is exactly the kind of setup which leads to instrumental convergence.) By making pareto-resource-efficient decisions in one context, the agent would leave itself maximum freedom in the other contexts. In some sense, the ultimate “resource” is the agent’s action space. Then, resource trade-offs implicitly tell us how the agent is trading off its degree of control within each context, which we can interpret as something-like-utility.

• the first links to a post which argues against VNM on the basis that it assumes probabilities and preferences are already in the model

I assume this is my comment + post; I’m not entirely sure what you mean here. Perhaps you mean that I’m not modeling the world as having “external” probabilities that the agent has to handle; I agree that is true, but that is because in the use case I’m imagining (looking at the behavior of an AI system and determining what it is optimizing) you don’t get these “external” probabilities.

I expect that if these two commenters read the full essay, and think carefully about how the theorems Yudkowsky is discussing differ from VNM, then their objections will look very different.

I assure you I read this full post (well, the Arbital version of it) and thought carefully about it before making my post; my objections remain. I discussed VNM specifically because that’s the best-understood coherence theorem and the one that I see misused in AI alignment most often. (That being said, I don’t know the formal statements of other coherence theorems, though I predict with ~98% confidence that any specific theorem you point me to would not change my objection.)

Yes, if you add in some additional detail about resources, assume that you do not have preferences over how those resources are used, and assume that there are preferences over other things that can be affected using resources, then coherence theorems tell you something about how such agents act. This doesn’t seem all that relevant to the specific, narrow setting which I was considering.

I agree that coherence arguments (including VNM) can be useful, for example by:

• Helping people make better decisions (e.g. becoming more comfortable with taking risk)

• Reasoning about what AI systems would do, given stronger assumptions than the ones I used (e.g. if you assume there are “resources” that the AI system has no preferences over).

Nonetheless, within AI alignment, prior to my post I heard the VNM argument being misused all the time (by rationalists /​ LWers, less so by others); this has gone down since then but still happens.

I think e.g. this talk is sneaking in the “resources” assumption, without arguing for it or acknowledging its existence, and this often misleads people (including me) into thinking that AI risk is implied by math based on very simple axioms that are hard to disagree with.

----

On the review: I don’t think this post should be in the Alignment section of the review, without a significant rewrite /​ addition clarifying why exactly coherence arguments are useful or important for AI alignment. As such I will vote against it.

I would however support it being part of the non-Alignment section of the review; as I’ve said before, I generally really like coherence arguments and they influence my own decision-making a lot (in fact, a big part of the reason I started working in AI alignment was thinking about the Arrhenius paradox, which has a very similar coherence flavor).

• I assume this is my comment + post

I was referring mainly to Richard’s post here. You do seem to understand the issue of assuming (rather than deriving) probabilities.

I discussed VNM specifically because that’s the best-understood coherence theorem and the one that I see misused in AI alignment most often.

This I certainly agree with.

I don’t know the formal statements of other coherence theorems, though I predict with ~98% confidence that any specific theorem you point me to would not change my objection.

Exactly which objection are you talking about here?

If it’s something like “coherence theorems do not say that tool AI is not a thing”, that seems true. Even today humans have plenty of useful tools with some amount of information processing in them which are probably not usefully model-able as expected utility maximizers.

But then you also make claims like “all behavior can be rationalized as EU maximization”, which is wildly misleading. Given a system, the coherence theorems map a notion of resources/​efficiency/​outcomes to a notion of EU maximization. Sure, we can model any system as an EU maximizer this way, but only if we use a trivial/​uninteresting notion of resources/​efficiency/​outcomes. For instance, as you noted, it’s not very interesting when “outcomes” refers to “universe-histories”. (Also, the “preferences over universe-histories” argument doesn’t work as well when we specify the full counterfactual behavior of a system, which is something we can do quite well in practice.)

Combining these points: your argument largely seems to be “coherence arguments apply to any arbitrary system, therefore they don’t tell us interesting things about which systems are/​aren’t <agenty/​dangerous/​etc>”. (That summary isn’t exactly meant to pass an ITT, but please complain if it’s way off the mark.) My argument is that coherence theorems do not apply nontrivially to any arbitrary system, so they could still potentially tell us interesting things about which systems are/​aren’t <agenty/​dangerous/​etc>. There may be good arguments for why coherence theorems are the wrong way to think about goal-directedness, but “everything can be viewed as EU maximization” is not one of them.

Yes, if you add in some additional detail about resources, assume that you do not have preferences over how those resources are used, and assume that there are preferences over other things that can be affected using resources, then coherence theorems tell you something about how such agents act. This doesn’t seem all that relevant to the specific, narrow setting which I was considering.

Just how narrow a setting are you considering here? Limited resources are everywhere. Even an e-coli needs to efficiently use limited resources. Indeed, I expect coherence theorems to say nontrivial things about an e-coli swimming around in search of food (and this includes the possibility that the nontrivial things the theorem says could turn out to be empirically wrong, which in turn would tell us nontrivial things about e-coli and/​or selection pressures, and possibly point to better coherence theorems).

• Exactly which objection are you talking about here?

If it’s something like “coherence theorems do not say that tool AI is not a thing”, that seems true.

Yes, I think that is basically the main thing I’m claiming.

But then you also make claims like “all behavior can be rationalized as EU maximization”, which is wildly misleading.

I tried to be clear that my argument was “you need more assumptions beyond just coherence arguments on universe-histories; if you have literally no other assumptions then all behavior can be rationalized as EU maximization”. I think the phrase “all behavior can be rationalized as EU maximization” or something like it was basically necessary to get across the argument that I was making. I agree that taken in isolation it is misleading; I don’t really see what I could have done differently to prevent there from being something that in isolation was misleading, while still being able to point out the-thing-that-I-believe-is-fallacious. Nuance is hard.

(Also, it should be noted that you are not in the intended audience for that post; I expect that to you the point feels obvious enough so as not to be worth stating, and so overall it feels like I’m just being misleading. If everyone were similar to you I would not have bothered to write that post.)

Also, the “preferences over universe-histories” argument doesn’t work as well when we specify the full counterfactual behavior of a system, which is something we can do quite well in practice.

I agree that if you have counterfactual behavior EU maximization is not vacuous. I don’t think that this meaningfully changes the upshot (which is “coherence arguments, by themselves without any other assumptions on the structure of the world or the space of utility functions, do not imply AI risk”). It might meaningfully change the title of the post (perhaps they do imply goal-directed behavior in some sense), though in that case I’d change the title to “Coherence arguments do not imply AI risk” and I think it’s effectively the same post.

Mostly though, I’m wondering how exactly you use counterfactual behavior in an argument for AI risk. Like, the argument I was arguing against is extremely abstract, and just claims that the AI is “intelligent” /​ “coherent”. How do you use that to get counterfactual behavior for the AI system?

I agree that for any given AI system, we could probably gain a bunch of knowledge about its counterfactual behavior, and then reason about how coherent it is and how goal-directed it is. But this is a fundamentally different thing than the thing I was talking about (which is just: can we abstractly argue for AI risk without talking about details of the system beyond “it is intelligent”?)

My argument is that coherence theorems do not apply nontrivially to any arbitrary system, so they could still potentially tell us interesting things about which systems are/​aren’t <agenty/​dangerous/​etc>.

I agree with this.

There may be good arguments for why coherence theorems are the wrong way to think about goal-directedness, but “everything can be viewed as EU maximization” is not one of them.

I actually also agree with this, and was not trying to argue that coherence arguments are irrelevant to “goal-directedness” or “being a good agent”—I’ve already mentioned that I personally do things differently thanks to my knowledge of coherence arguments.

Just how narrow a setting are you considering here? Limited resources are everywhere. Even an e-coli needs to efficiently use limited resources. Indeed, I expect coherence theorems to say nontrivial things about an e-coli swimming around in search of food (and this includes the possibility that the nontrivial things the theorem says could turn out to be empirically wrong, which in turn would tell us nontrivial things about e-coli and/​or selection pressures, and possibly point to better coherence theorems).

I agree that if you take any particular system and try to make predictions, the necessary assumptions (such as “what counts as a limited resource”) will often be easy and obvious and the coherence theorems do have content in such situations. It’s the abstract argument that feels flawed to me.

I somewhat expect your response will be “why would anyone be applying coherence arguments in such a ridiculously abstract way rather than studying a concrete system”, to which I would say that you are not in the intended audience.

----

Fwiw thinking this through has made me feel better about including it in the Alignment book than I did before, though I’m still overall opposed. (I do still think it is a good fit for other books.)

• I somewhat expect your response will be “why would anyone be applying coherence arguments in such a ridiculously abstract way rather than studying a concrete system”, to which I would say that you are not in the intended audience.

Ok, this is a fair answer. I think you and I, at least, are basically aligned here.

I do think a lot of people took away from your post something like “all behavior can be rationalized as EU maximization”, and in particular I think a lot of people walked away with the impression that usefully applying coherence arguments to systems in our particular universe is much more rare/​difficult than it actually is. But I can’t fault you much for some of your readers not paying sufficiently close attention, especially when my review at the top of this thread is largely me complaining about how people missed nuances in this post.

• (Once again, great use of that link)

• On the review: I don’t think this post should be in the Alignment section of the review, without a significant rewrite /​ addition clarifying why exactly coherence arguments are useful or important for AI alignment.

Assuming that one accepts the arguments against coherence arguments being important for alignment (as I tentatively do), I don’t see why that means this shouldn’t be included in the Alignment section.

The motivation for this post was its relevance to alignment. People think about it in the context of alignment. If subsequent arguments indicate that it’s misguided, I don’t see why that means it shouldn’t be considered (from a historical perspective) to have been in the alignment stream of work (along with the arguments against it).

(Though, I suppose if there’s another category that seems like a more exact match, that seems like a fine reason to put it in that section rather than the Alignment section.)

Does that make sense? Is your concern that people will see this in the Alignment section, and not see the arguments against the connection, and continue to be misled?

• I actually think it shouldn’t be in the alignment section, though for different reasons than Rohin. There’s lots of things which can be applied to AI, but are a lot more general, and I think it’s usually better to separate the “here’s the general idea” presentation from the “here’s how it applies to AI” presentation. That way, people working on other interesting things can come along and notice the idea and try to apply it in their own area rather than getting scared off by the label.

For instance, I think there’s probably gains to be had from applying coherence theorems to biological systems. I would love it if some rationalist biologist came along, read Yudkowsky’s post, and said “wait a minute, cells need to make efficient use of energy/​limited molecules/​etc, can I apply that?”. That sort of thing becomes less likely if this sort of post is hiding in “the alignment section”.

Zooming out further… today, alignment is the only technical research area with a lot of discussion on LW, and I think it would be a near-pareto improvement if more such fields were drawn in. Taking things which are alignment-relevant-but-not-just-alignment and lumping them all under the alignment heading makes that less likely.

• That makes a lot of sense to me. Good points!

• It seems weird to include a post in the book if we believe that it is misguided, just because people historically believed it. If I were making this book, I would not include such posts; I’d want an “LW Review” to focus on things that are true and useful, rather than historically interesting.

That being said, I haven’t thought much about the goals of the book, and if we want to include posts for the sake of history, then sure, include the post. That was just not my impression about the goal.

Is your concern that people will see this in the Alignment section, and not see the arguments against the connection, and continue to be misled?

I would have this concern, yes, but I’m happy to defer (in the sense of “not pushing”, rather than the sense of “adopting their beliefs as my own”) to the opinions of the people who have thought way more than me about the purpose of this review and the book, and have caused it to happen. If they are interested in including historically important essays that we now think are misguided, I wouldn’t object. I predict that they would prefer not to include such essays but of course I could be wrong about that.

• I like this comment, but I feel sort of confused about it as a review instead of an elaboration. Yes, coherence theorems are very important, but did people get it from this post? To the extent that comments are evidence, they look like no, the post didn’t quite make it clear to them what exactly is going here.

• No need to think about editing at this point, we’ll sort out all editing issues after the review. (And for this specific issue, all hyperlinks in the books have been turned into readable footnotes, which works out just fine in the vast majority of cases.)