A summary of Savage’s foundations for probability and utility.

Sniffnoy22 May 2011 19:56 UTC

85 points

Edit: I think the P2c I wrote originally may have been a bit too weak; fixed that. Nevermind, rechecking, that wasn’t needed.

More edits (now consolidated): Edited nontriviality note. Edited totality note. Added in the definition of numerical probability in terms of qualitative probability (though not the proof that it works). Also slight clarifications on implications of P6′ and P6‴ on partitions into equivalent and almost-equivalent parts, respectively.

One very late edit, June 2: Even though we don’t get countable additivity, we still want a σ-algebra rather than just an algebra (this is needed for some of the proofs in the “partition conditions” section that I don’t go into here). Also noted nonemptiness of gambles.

The idea that rational agents act in a manner isomorphic to expected-utility maximizers is often used here, typically justified with the Von Neumann-Morgenstern theorem. (The last of Von Neumann and Morgenstern’s axioms, the independence axiom, can be grounded in a Dutch book argument.) But the Von Neumann-Morgenstern theorem assumes that the agent already measures its beliefs with (finitely additive) probabilities. This in turn is often justified with Cox’s theorem (valid so long as we assume a “large world”, which is implied by e.g. the existence of a fair coin). But Cox’s theorem assumes as an axiom that the plausibility of a statement is taken to be a real number, a very large assumption! I have also seen this justified here with Dutch book arguments, but these all seem to assume that we are already using some notion of expected utility maximization (which is not only somewhat circular, but also a considerably stronger assumption than that plausibilities are measured with real numbers).

There is a way of grounding both (finitely additive) probability and utility simultaneously, however, as detailed by Leonard Savage in his Foundations of Statistics (1954). In this article I will state the axioms and definitions he gives, give a summary of their logical structure, and suggest a slight modification (which is equivalent mathematically but slightly more philosophically satisfying). I would also like to ask the question: To what extent can these axioms be grounded in Dutch book arguments or other more basic principles? I warn the reader that I have not worked through all the proofs myself and I suggest simply finding a copy of the book if you want more detail.

Peter Fishburn later showed in Utility Theory for Decision Making (1970) that the axioms set forth here actually imply that utility is bounded.

(Note: The versions of the axioms and definitions in the end papers are formulated slightly differently from the ones in the text of the book, and in the 1954 version have an error. I’ll be using the ones from the text, though in some cases I’ll reformulate them slightly.)

Primitive notions; preference given a set of states

We will use the following primitive notions. Firstly, there is a set S of “states of the world”; the exact current state of the world is unknown to the agent. Secondly, there is a set F of “consequences”—things that can happen as a result of the agent’s actions. Actions or acts will be interpreted as functions f:S→F, as two actions which have the same consequences regardless of the state of the world are indistinguishable and hence considered equal. While the agent may be uncertain as to the exact results of its actions, this can be folded into his uncertainty about the state of the world. Finally, we introduces as primitive a relation ≤ on the set of actions, interpreted as “is not preferred to”. I.e., f≤g means that given a choice between actions f and g, the agent will either prefer g or be indifferent. As usual, sets of states will be referred to as “events”, and for the usual reasons we may want to restrict the set of admissible events to a boolean σ-subalgebra of ℘(S), though I don’t know if that’s really necessary here (Savage doesn’t seem to do so, though he does discuss it some). [Edit nine years later: This actually introduces a slight issue I didn’t realize before, but fortunately it’s easily fixed.]

In any case, we then have the following axiom:

P1. The relation ≤ is a total preorder.

The intuition here for transitivity is pretty clear. For totality, if the agent is presented with a choice of two acts, it must choose one of them! Or be indifferent. Perhaps we could instead use a partial preorder (or order?), though this would give us two different indistinguishable flavors of indifference, which seems problematic. But this could be useful if we wanted intransitive indifference. So long as indifference is transitive, though, we can collapse this into a total preorder.

As usual we can then define f≥g, f<g (meaning “it is false that g≤f”), and g>f. I will use f≡g to mean “f≤g and g≤f”, i.e., the agent is indifferent between f and g. (Savage uses an equals sign with a dot over it.)

Note that though ≤ is defined in terms of how the agent chooses when presented with two options, Savage later notes that there is a construction of W. Allen Wallis that allows one to adduce the agent’s preference ordering among a finite set of more than two options (modulo indifference): Simply tell the agent to rank the options given, and that afterward, two of them will be chosen uniformly at random, and it will get whichever one it ranked higher.

The second axiom states that if two actions have the same consequences in some situation, just what that equal consequence is does not affect their relative ordering:

P2. Suppose f≤g, and B is a set of states such f and g agree on B. If f’ and g’ are another pair of acts which, outside of B, agree with f and g respectively, and on B, agree with each other, then f’≤g’.

In other words, to decide between two actions, only the cases where they actually have different consequences matter.

With this axiom, we can now define:

D1. We say “f≤g given B” to mean that if f’ and g’ are actions such that f’ agrees with f on B, g’ agrees with g on B, and f’ and g’ agree with each other outside of B, then f’≤g’.

Due to axiom P2, this is well-defined.

Here is where I would like to suggest a small modification to this setup. The notion of “f≤g given B” is implicitly taken to be how the agent makes decisions if it knows that B obtains. However it seems to me that we should actually take “f≤g given B”, rather than f≤g, to be the primitive notion, explicitly interpeted as “the agent does not prefer f to g if it knows that B obtains”. The agent always has some state of prior knowledge and this way we have explicitly specified decisions under a given state of knowledge—the acts we are concerned with—as the basis of our theory. Rather than defining f≤g given B in terms of ≤, we can define f≤g to mean “f≤g given S” and then add additional axioms governing the relation between “≤ given B” for varying B, which in Savage’s setup are theorems or part of the definition D1.

(Specifically, I would modify P1 and P2 to talk about “≤ given B” rather than ≤, and add the following theorems as axioms:

P2a. If f and g agree on B, then f≡g given B.

P2b. If B⊆C, f≤g given C, and f and g agree outside B, then f≤g given B.

P2c. If B and C are disjoint, and f≤g given B and given C, then f≤g given B∪C.

This is a little unwieldy and perhaps there is an easier way—these might not be minimal. But they do seem to be sufficient.)

In any case, regardless which way we do it, we’ve now established the notion of preference given that a set of states obtains, as well as preference without additional knowledge, so henceforth I’ll freely use both as Savage does without worrying about which makes a better foundation, since they are equivalent.

Ordering on preferences

The next definition is simply to note that we can sensibly talk about f≤b, b≤f, b≤c where here b and c are consequences rather than actions, simply by interpreting consequences as constant functions. (So the agent does have a preference ordering on consequences, it’s just induced from its ordering on actions. We do it this way since it’s its choices between actions we can actually see.)

However, the third axiom reifies this induced ordering somewhat, by demanding that it be invariant under gaining new information.

P3′. If b and c are consequences and b≤c, then b≤c given any B.

Thus the fact that the agent may change preferences given new information, just reflects its uncertainty about the results of their actions, rather than actually preferring different consequences in different states (any such preferences can be done away with by simply expanding the set of consequences).

Really this is not strong enough, but to state the actual P3 we will first need a definition:

D3. An event B is said to be null if f≤g given B for any actions f and g.

Null sets will correspond to sets of probability 0, once numerical probability is introduced. Probability here is to be adduced from the agent’s preferences, so we cannot distinguish between “the agent is certain that B will not happen” and “if B obtains, the agent doesn’t care what happens”.

Now we can state the actual P3:

P3. If b and c are consequences and B is not null, then b≤c given B if and only if b≤c.

P3′, by contrast, allowed some collapsing of preference on gaining new information; here we have disallowed that except in the case where the new information is enough to collapse all preferences entirely (a sort of “end of the world” or “fatal error” scenario).

Qualitative probability

We’ve introduced above the idea of “probability 0” (and hence implicitly probability 1; observe that “¬B is null” is equivalent to “for any f and g, f≤g given B if and only if f≤g”). Now we want to expand this to probability more generally. But we will not initially get numbers out of it; rather we will first just get another total preordering, A≤B, “A is at most as probable as B”.

How can we determine which of two events the agent thinks is more probable? Have it bet on them, of course! First, we need a nontriviality axiom so it has some things to bet on.

P5. There exist consequences b and c such that b>c.

(I don’t know what the results would be if instead we used the weaker nontriviality axiom “there exist actions f and g such that f<g”, i.e., “S is not null”. That we eventually get that expected utility for comparing all acts suggests that this should work, but I haven’t checked.)

So let us now consider a class of actions which I will call “wagers”. (Savage doesn’t have any special term for these.) Define “the wager on A for b over c” to mean the action that, on A, returns b, and otherwise, returns c. Denote this by $w_{A, b, c}$ . Then we postulate:

P4. Let b>b’ be a pair of consequences, and c>c’ another such pair. Then for any events A and B, $w_{A, b, b^{'}} \leq w_{B, b, b^{'}}$ if and only if $w_{A, c, c^{'}} \leq w_{B, c, c^{'}}$ .

That is to say, if the agent is given the choice between betting on event A and betting on event B, and the prize and booby prize are the same regardless of which it bets on, then it shouldn’t just matter just what the prize and booby prize are—it should just bet on whichever it thinks is more probable. Hence we can define:

D4. For events A and B, we say “A is at most as probable as B”, denoted A≤B, if $w_{A, b, b^{'}} \leq w_{B, b, b^{'}}$ , where b>b’ is a pair of consequences.

By P4, this is well-defined. We can then show that the relation on events ≤ is a total preorder, so we can use the usual notation when talking about it (again, ≡ will denote equivalence).

In fact, ≤ is not only a total preorder, but a qualitative probability:

≤ is a total preorder
∅≤A for any event A
∅<S
Given events B, C, and D with D disjoint from B and C, then B≤C if and only if B∪D≤C∪D.

(There is no condition corresponding to countable additivity; as mentioned above, we simply won’t get countable additivity out of this.) Note also that under this, A≡∅ if and only if A is null in the earlier sense. Also, we can define “A≤B given C” by comparing the wagers given C; this is equivalent to the condition that A∩C≤B∩C. This relation is too a qualitative probability.

Partition conditions and numerical probability

In order to get real numbers to appear, we are of course going to have to make some sort of Archimedean assumption. In this section I discuss what some of these look like and then ultimately state P6, the one Savage goes with.

First, definitions. We will be considering finitely-additive probability measures on the set of states, i.e. a function P from the set of events to the interval [0,1] such that P(S)=1, and for disjoint B and C, P(B∪C)=P(B)+P(C). We will say “P agrees with ≤” if for every A and B, A≤B if and only if P(A)≤P(B); and we will say “P almost agrees with ≤” if for every A and B, A≤B implies P(A)≤P(B). (I.e., in the latter case, numerical probability is allowed to collapse some distinctions between events that the agent might not actually be indifferent between.)

We’ll be considering here partitions of the set of states S. We’ll say a partition of S is “uniform” if the parts are all equivalent. More generally we’ll say it is “almost uniform” if, for any r, the union of any r parts is at most as probable as the union of any r+1 parts. (This is using ≤, remember; we don’t have numerical probabilities yet!) (Note that any uniform partition is almost uniform.) Then it turns out that the following are equivalent:

There exist almost-uniform partitions of S into arbitrarily large numbers of parts.
For any B>∅, there exists a partition of S with each part less probable than B.
There exists a (necessarily unique) finitely additive probability measure P that almost agrees with ≤, which has the property that for any B and any 0≤λ≤1, there is a C⊆B such that P(C)=λP(B).

(Definitely not going into the proof of this here. However, the actual definition of the numerical probability P(A) is not so complicated: Let k(A,n) denote the largest r such that there exists an almost-uniform partition of S into n parts, for which there is some union of r parts, C, such that C≤A. Then the sequence k(A,n)/n always converges, and we can define P(A) to be its limit.)

So we could use this as our 6th axiom:

P6‴. For any B>∅, there exists a partition of S with each part less probable than B.

Savage notes that other authors have assumed the stronger

P6″. There exist uniform partitions of S into arbitrarily large numbers of parts.

since there’s an obvious justification for this: the existence of a fair coin! If a fair coin exists, then we can generate a uniform partition of S into 2n parts simply by flipping it n times and considering the result. We’ll actually end up assuming something even stronger than this.

So P6‴ does get us numerical probabilities, but they don’t necessarily reflect all of the qualitative probability; P6‴ is only strong enough to force almost agreement. Though it is stronger than that when ∅ is involved—it does turn out that P(B)=0 if and only if B≡∅. (And hence also P(B)=1 if and only if B≡S.) But more generally it turns out that P(B)=P(C) if and only if B and C are “almost equivalent”, which I will denote B≈C (Savage uses a symbol I haven’t seen elsewhere), which is defined to mean that for any E>∅ disjoint from B, B∪E≥C, and for any E>∅ disjoint from C, C∪E≥B.

(It’s not obvious to me that ≈ is in general an equivalence relation, but it certainly is in the presence of P6‴; Savage seems to use this implicitly. Note also that another consequence of P6‴ is that for any n there exists a partition of S into n almost-equivalent parts; such a partition is necessarily almost-uniform.)

However the following stronger version of P6‴ gets rid of this distinction:

P6′. For any B>C, there exists a partition of S, each part D of which satisfies C∪D<B.

(Observe that P6‴ is just P6′ for C=∅.) Under P6′, almost equivalence is equivalence, and so numerical probability agrees with qualitative probability, and we finally have what we wanted. (So by earlier, P6′ implies P6″, not just P6‴. Indeed by above it implies the existence of uniform partitions into n parts for any n, not just arbitrarily large n.)

In actuality, Savage assumes an even stronger axiom, which is needed to get utility and not just probability:

P6. For any acts g<h, and any consequence b, there is a partition of S such that if g is modified on any one part to be constantly b there, we would still have g<h; and if h is modified on any one part to be constantly b there, we would also still have g<h.

Applying P6 to wagers yields the weaker P6′.

We can now also get conditional probability—if P6′ holds, it also holds for the preorderings “≤ given C” for non-null C, and hence we can define P(B|C) to be the probability of B under the quantitative probability we get corresponding to the qualitative probabilty “≤ given C”. Using the uniqueness of agreeing probability measures, it’s easy to check that indeed, P(B|C)=P(B∩C)/P(C).

Utility for finite gambles

Now that we have numerical probability, we can talk about finite gambles. If we have consequences $b_{1}$ , …, $b_{n}$ , and probabilities $λ_{1}$ , …, $λ_{n}$ summing to 1, we can consider the gamble $\sum λ_{i} b_{i}$ , represented by any action which yields $b_{1}$ with probability $λ_{1}$ , $b_{2}$ with probability $λ_{2}$ , etc. (And with probability 0 does anything; we don’t care about events with probability 0.) Note that by above such an action necessarily exists. It can be proven that any two actions representing the same gamble are equivalent, and hence we can talk about comparing gambles. We can also sensibly talk about mixing gambles—taking $\sum λ_{i} f_{i}$ where the $f_{i}$ are finite gambles, and the $λ_{i}$ are probabilities summing to 1 - in the obvious fashion.

With these definitions, it turns out that Von Neumann and Morgenstern’s independence condition holds, and, using axiom P6, Savage shows that the continuity (i.e. Archimedean) condition also holds, and hence there is indeed a utility function, a function U:F→R such that for any two finite gambles represented by f and g respectively, f≤g if and only if the expected utility of the first gamble is less than or equal to that of the second. Furthermore, any two such utility functions are related via an increasing affine transformation.

We can also take expected value knowing that a given event C obtains, since we have numerical probability; and indeed this agrees with the preference ordering on gambles given C.

Expected utility in general and boundedness of utility

Finally, Savage shows that if we assume one more axiom, P7, then we have that for any essentially bounded actions f and g, we have f≤g if and only if the expected utility of f is at most that of g. (It is possible to define integration with respect to a finitely additive measure similarly to how one does with respect to a countably additive measure; the result is linear and monotonic but doesn’t satisfy convergence properties.) Similarly with respect to a given event C.

The axiom P7 is:

P7. If f and g are acts and B is an event such that f≤g(s) given B for every s∈B, then f≤g given B. Similarly, if f(s)≤g given B for every s in B, then f≤g given B.

So this is just another variant on the “sure-thing principle” that I earlier labeled P2c.

Now in fact it turns out as mentioned above that P7, when taken together with the rest, implies that utility is bounded, and hence that we do indeed have that for any f and g, f≤g if and only if the expected utility of f is at most that of g! This is due to Peter Fishburn and postdates the first edition of Foundations of Statistics, so in there Savage simply notes that it would be nice if this worked for f and g not necessarily essentially bounded (so long as their expected values exist, and allowing them to be ±∞), but that he can’t prove this, and then adds a footnote giving a reference for bounded utility. (Though he does prove using P7 that if you have two acts f and g such that f,g≤b for all consequences b, then f≡g; similarly if f,g≥b for all b. Actually, this is a key lemma in proving that utility is bounded; Fishburn’s proof works by showing that if utility were unbounded, you could construct two actions that contradict this.)

Of course, if you really don’t like the conclusion that utility is bounded, you could throw out axiom 7! It’s pretty intuitive, but it’s not clear that ignoring it could actually get you Dutch-booked. After all, the first 6 axioms are enough to handle finite gambles, 7 is only needed for more general situations. So long as your Dutch bookie is limited to finite gambles, you don’t need this.

Questions on further justification

So now that I’ve laid all this out, here’s the question I originally meant to ask: To what extent can these axioms be grounded in more basic principles, e.g. Dutch book arguments? It seems to me that most of these are too basic for that to apply—Dutch book arguments need more working in the background. Still, it seems to me axioms P2, P3, and P4 might plausibly be grounded this way, though I have not yet attempted to figure out how. P7 presumably can’t, for the reasons noted in the previous section. P1 I assume is too basic. P5 obviously can’t (if the agent doesn’t care about anything, that’s its own problem).

P6 is an Archimedean condition. Typically I’ve seen those (specifically Von Neumann and Morgenstern’s continuity condition) justified on this site with the idea that infinitesimals will never be relevant in any practical situation—if c has only infinitesimally more utility than b, the only case when the distinction would be relevant is if the probabilities of accomplishing them were exactly equal, which is not realistic. I’m guessing infinitesimal probabilities can probably be done away with in a similar manner?

Or are these not good axioms in the first place? You all are more familiar with these sorts of things than me. Ideas?

What links here?

Sniffnoy22 May 2011 19:56 UTC

85 points

92 comments13 min readLW link Archive

Utility Functions Logic & Mathematics