Probability theory is the study of idealized inference. In particular, it’s the study of a precise formal system that, effectively, generalizes propositional logic to the inductive setting. This formal system is adequate to capture huge swaths of common sense reasoning. Conform to the rules of this formal system, and your inferential power will always match or exceed that offered by received statistical techniques—both in practice and in the theoretical idealization. Or so Jaynes argues, anyways.

There’s so much here. I’ll focus on Chapters 1-4, but will excerpt key bits from throughout.

Contents and Notes

1. Plausible Reasoning

This is a book about mathematically constructing an inference engine.

Our robot is going to reason about propositions. As already indicated above, we shall denote various propositions by italicized capital letters, {A,B,C,etc.}, and for the time being we must require that any proposition used must have, to the robot an unambiguous meaning and must be of the simple, definite logical type that must be either true or false… We do not require that the truth or falsity of such an ‘Aristotelian proposition’ be ascertainable by any feasible investigation; indeed, our inability to do this is usually just the reason why we need the robot’s help.

…

To state these ideas more formally, we introduce some notation of the usual symbolic logic, or Boolean algebra, so called because George Boole (1854) introduced a notation similar to the following (p. 9).

The logical product

AB

means the conjunction A∧B.

The logical sum

A+B

means the disjunction A∨B.

Both operations are commutative in Boolean algebra, reflecting the fact that the order of the conjuncts or disjuncts in a conjunction or disjunction doesn’t change that sentence’s truth-value:

AB=BAA+B=B+A

A proposition’s negation is denoted by a barred letter:

¯A:=¬A

The Basic Desiderata

Jaynes has four intuitive desiderata he wants out of his inference engine:

To each proposition about which it reasons, our robot must assign some degree of plausibility, based on the evidence we have given it; and whenever it receives new evidence it must revise these assessments to take that new evidence into account. In order that these plausibility assignments can be stored and modified in the circuits of its brain, they must be associated with some definite physical quantity, such as voltage or pulse duration or a binary coded number, etc. -- however our engineers want to design the details. For present purposes, this means that there will have to be some kind of association between degrees of plausibility and real numbers:

(I) Degrees of plausibility are represented by real numbers.

Desideratum (I) is practically forced on us by the requirement that the robot’s brain must operate by the carrying out of some definite physical process. However, it will appear… that it is also required theoretically; we do not see the possibility of any consistent theory without a property that is equivalent functionally to desideratum (I) (p. 17).

The conditional

A|B

stands for some real number, the “conditional plausibility that A is true, given that B is true” (p. 17).

Also, the engine ought to reason in:

(II) Qualitative correspondence with common sense.

Finally we want to give our robot another desirable property for which honest people strive without always attaining: that it always reasons consistently. By this we mean just the three common colloquial meanings of the word ‘consistent’:

(IIIa) If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.

(IIIb) The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains. In other words, the robot is nonideological.

(IIIc) The robot always represents equivalent states of knowledge by equivalent plausibility assignments. That is, if in two problems the robot’s state of knowledge is the same (except perhaps for the labeling of the propositions), then it must assign the same plausibilities to both.

Desiderata (I), (II), and (IIIa) are the basic ‘structural’ requirements on the inner workings of our robot’s brain, while (IIIb) and (IIIc) are ‘interface’ conditions which show how the robot’s behavior should relate to the outer world.

At this point, most students are surprised to learn that our search for desiderata is at an end. The above conditions, it turns out, uniquely determine the rules by which our robot must reason; i.e. there is only one set of mathematical operations for manipulating plausibilities which has all these properties (pp. 18-9).

2. The Quantitative Rules

The Product Rule

Going off of just the above desiderata, we’re going to build up the machinery of logical products. It follows from (I) that the logical product of propositions A and B conditional on some known background proposition C, denoted AB|C, is a real number.

In order for AB to be a true proposition, it is necessary that B is true. Thus the plausibility B|C should be involved. In addition, if B is true, it is further necessary that A should be true; so the plausibility A|BC is also needed. But if B is false, then of course AB is false independently of whatever one knows about A, as expressed by A|¯BC; if the robot reasons first about B, then the plausibility of A will be relevant only if B is true. Thus, if the robot has B|C and A|BC it will not need A|C. That would tell it nothing about AB that is did not have already.

Similarly, A|B and B|A are not needed; whatever plausibility A or B might have in the absence of information C could not be relevant to judgements of a case in which the robot knows that C is true...

Of course, singe the logical product is commutative, AB=BA, we could interchange A and B in the above statements; i.e. knowledge of A|C and B|AC would serve equally well to determine AB|C=BA|C. That the robot must obtain the same value for AB|C from either procedure is one of our conditions of consistency, desideratum (IIIa).

We can state this in a more definite form.AB|C will be some function of B|C and A|BC:

AB|C=F(B|C,A|BC)

(pp. 24-5, notation converted).

If our background knowledge is changed from C to C′, such that one of the conjuncts becomes a little more likely, the conjunction AB|C′ must also become a little more likely. So F(x,y), with x,y∈R, must be a continuous monotonic increasing function of both x and y (p. 26).

Expanding out the conditional ABC|D with the help of this function F in two ways, we get

F(F(C|D,B|CD),A|BCD)=ABC|D=F(C|D,AB|CD)

Let x=C|D, y=B|CD, and z=A|BCD here, with x,y,z∈R. This establishes another constraint on our function F:

F(F(x,y),z)=F(x,F(y,z))

This equation flows from the associativity of Boolean algebra together with desideratum (IIIa) (p. 26-7).

After further manipulations, Jaynes fixes another equation

w(F(x,y))=w(x)w(y)

where the function w is defined by

w(x)=exp(∫xk1H(x)dx)

with the function H being arbitrary and k being some constant (p. 28).^{[1]} “By it’s construction… w(x) must be a positive continuous monotonic function, increasing or decreasing according to the sign of H(x); at this stage it is otherwise arbitrary” (p. 29).

Recall that

AB|C=F(A|BC,B|C)=F(B|AC,A|C)

Applying w to all expressions in the equation, we get

w(AB|C)=w(F(A|BC,B|C))=w(F(B|AC,A|C))

And using what we’ve just proven about w and F(x,y), we derive the product rule:

w(AB|C)=w(A|BC)w(B|C)=w(B|AC)w(A|C)

The requirements of qualitative correspondence with common sense impose further conditions on the function w(x). For example, [in the product rule,] suppose that A is certain, given C. Then in the ‘logical environment’ produced by knowledge of C, the propositions AB and B are the same, in the sense that one is true if and only if the other is true. By our most primitive axiom of all… propositions with the same truth value must have equal plausibility:

AB|C=B|C

and also we will have

A|BC=A|C

because if A is already certain given C (i.e.C implies A), then, given any other information B which does not contradict C, it is still certain. In this case...

w(B|C)=w(A|C)w(B|C)

and this must hold no matter how plausible or implausible B is to the robot. So our function w(x) must have the property that

certainty is represented by w(A|C)=1

Now suppose that A is impossible, given C. Then the proposition AB is also impossible given C:

AB|C=A|C

and if A is already impossible given C (i.e.C implies ¯A), then, given any further information B which does not contradict C, A would still be impossible:

A|BC=A|C

In this case...

w(A|C)=w(A|C)w(B|C)

and again this equation must hold no matter what plausibility B might have. There are only two possible values of w(A|C) that could satisfy this condition: it could be zero or +∞…

In summary, qualitative correspondence with common sense requires that w(x) be a positive continuous monotonic function. It may be either increasing or decreasing. If it is increasing, it must range from zero for impossibility up to one for certainty. If it is decreasing, it must range from ∞ for impossibility down to one for certainty. Thus far, our conditions say nothing about how it varies between these limits (p. 29-30).

It’s easy to translate back and forth between the scales [0,1] and [+∞,1] by just defining a new function w2

w2(x):=1w1(x)

Thus, without loss of generality, Jaynes henceforth adopts the [0,1] convention for representing the gamut from known impossibility to known certainty (p. 30).

The Sum Rule

The logical product A¯A is always false, the logical sum A+¯A always true. The plausibility that A is false must depend in some way on the plausibility that it is true. If we define u=w(A|B), v=w(¯A|B), there must exist some functional relation

v=S(u)

(p. 30).

Maybe you can see where Jaynes is going here. He isn’t just going to assume as an axiom that all disjoint, collectively exhaustive probabilities sum to 1, even though that claim seems plenty intuitively compelling. He’s going to again derive that from the above desiderata.

We now have a functional S, as well as the product rule from earlier. The former will have to end up being consistent with the latter, and so the product rule will help to give this functional S some shape. I’ll skip over most of Jaynes’ derivation,^{[2]} up to:

Our results up to this point can be summarized as follows. Associativity of the logical product requires that some monotonic function w(x) of the plausibility x=A|B must obey the product rule… Our result… states that this same function must also obey a sum rule:

wm(A|B)+wm(¯A|B)=1

for some positive m.^{[3]} Of course, the product rule itself could be written equally well as

wm(AB|C)=wm(A|C)wm(B|AC)=wm(B|C)wm(A|BC)

but then we see that the value of m is actually irrelevant; for whatever value is chosen, we can define a new function

Are further relations needed to yield a complete set of rules for plausible inference, adequate to determine the plausibility of any logic function f(A1,...,An) from those of {A1,...,AN}? We have, in the product rule [and the sum rule], formulas for the plausibility of the conjunction AB and the negation ¯A… Conjunction and negation are an adequate set of operations, from which all logic functions can be constructed.

Therefore, one would conjecture that our search for basic rules should be finished; it ought to be possible, by repeated applications of the product rule and sum rule, to arrive at the plausibility of any proposition in the Boolean algebra generated by {A1,...,AN}.

To verify this, we seek first a formula for the logical sum A+B. Applying the product rule and sum rule repeatedly, we have

[This is called the generalized sum rule] (pp. 33-4, notation converted).

Jaynes has hitherto carefully avoided using the term ‘probability,’ the conventional P(A|B) notation for it, or the intuitions behind it. But the machinery developed so far has been demonstrated to have all the requisite properties of probability! So Jaynes will now formally christen these values with the conventional symbolism

P(A|B):=p(A|B)

as well as term them probabilities. That is, the machinery of probability theory is hereby considered set up; all that remains is to demonstrate its inferential power.

3. Elementary Sampling Theory

If on background information B the hypotheses {H1,H2,...,HN} are mutually exclusive and exhaustive, and B does not favor any one of them over any other, then

P(Hi|B)=1N,1≤i≤N

[From this equation and the generalized sum rule, we can derive] the Bernoulli urn rule: if B specifies that A is true on some subset M of the Hi, and false on the remaining N−|M|, then

P(A|B)=MN

…

Essentially all of conventional probability theory as currently taught, plus many important results that are often thought to lie beyond the domain of probability theory, can be derived from [just this] foundation (p. 51, notation converted).

Sampling without Replacement

A central refrain of Jaynes’ is that thou shalt not commit the mind-projection fallacy.In this section he’ll talk about the probabilities of drawing certain balls from urns. The probabilistic properties of these balls and urns… aren’t physical characteristics of the balls, urns, an urn’s being stirred, and/or your hand reaching in at all. All the probabilities discussed are features of what an inference engine has observed about the world beforehand. If the inference engine observed that the top of the urn is all red balls, its probability for a red ball on the next draw is 1. If another inference engine did not get to make that observation, its probability of a red ball on the next draw is different!

Relatedly, in our formalism, observe that while A|B∈R, A∉R. A is a proposition, not a number, and soP(A)isn’t even a well-formed probability! Probabilities are defined by a hypothesis and a set of past observations, and are undefined without a memory of past observations.

Let the symbol Ri stand for the proposition, “On the ith draw, a red ball comes up,” and Wi, the proposition, “On the ith draw, a white ball comes up.” Let the symbol B stand for the proposition, “An urn contains N balls, all identical except for color, with M of them red and N−M of them white. We will draw from the urn without replacement, and repeat until a target number of total draws n is reached.” Our inference engine can now generate

P(R1|B)=MN

for the probability that the first ball drawn is red, conditional on observing only that background setup proposition B.

What is the probability of two reds in a row coming up, conditional only on B? By the product rule:

P(R1R2|B)=P(R1|B)P(R2|R1B)

Combining this with our previous equation:

P(R1R2|B)=MNP(R2|R1B)

As for the second factor, the new background observation R1B asserts that one ball, and one red ball specifically, has been removed. Thus:

P(R2|R1B)=M−1N−1P(R1R2|B)=MNM−1N−1

Continuing in this way, the probability for red on the first r consecutive draws is^{[4]}

The probability for white on the first w draws is similar but for the interchange of M and (N−M):

P(W1W2⋯Ww|B)=(N−M)!(N−w)!(N−M−w)!N!

Then, the probability for white on draws r+1,r+2,⋯,r+w given that we got red on the first r draws [is]:

P(Wr+1⋯Wr+w|R1⋯RrB)=(N−M)!(N−r−w)!(N−M−w)!(N−r)!

and so, by the product rule, the probability for obtaining r red followed by w=n−r white in n draws [is],

P(R1⋯RrWr+1⋯Wn|B)=M!(N−M)!(N−n)!(M−r)!(N−M−w)!N!

(pp. 53-4, notation converted).

These fractions can quickly become visually cluttered, depending on how general the proposition you’re asking about (left of the conditional bar) is. Remember that these busy fractions are just reduced products of many simpler fractions, and any strange extra terms appearing in the numerator and denominator are there in order to cancel out terms over on the denominator or numerator, respectively.

Expectations

If a variable quantity can take on the particular values x1,⋯,xn in n mutually exclusive and exhaustive situations, and the robot assigns corresponding probabilities p1,p2,⋯,pn to them, then the quantity

E(X):=n∑i=1pixi

is called the expectation… It is a weighed average of the possible values, weighted according to their probabilities.

…

When the fraction MN of red balls is known, then the Bernoulli urn rule applies and P(R1|B)=MN. When MN is unknown, the probability for red is the expectation of MN:

P(R1|B)=E(MN)

(p. 67, notation converted).

Sampling with Replacement

Suppose the inference engine draws one ball from the urn, examines it, and then returns it. Exactly how it returns the ball will logically determine much about what its next draw will look like. Setting that ball exactly on top of the heap means that the next draw will be the same ball, with probability 1. Leaving the ball somewhere in the top half of the heap and then sampling from that top half means the probability of a same-color draw must be elevated somewhat.

The procedure the inference engine adopts is to embed the ball in the urn and then vigorously shake it. The inference engine remembers that it drew a red or white on its previous draw. Now, however, that background proposition (right of the conditional bar) doesn’t logically fix much of anything about what will be drawn next. That replacement procedure logically decorrelated the inference engine’s remembered observations from its upcoming observation. So, as a best approximation, the inference engine declares its past observations irrelevant to the question of what it will next draw. In symbols, supposing that it drew white last time,

P(R2|W1B′)=MN

where B′ represents the above reasoning and known problem setup.

This is not just a repetition of what we learned [earlier]; what is new here is that the result now holds whatever information the robot may have about what happened in the other trials. This leads us to write the probability for drawing exactly r red balls in n trials, regardless of order, as

(n!r!(n−r)!)(MN)r(N−MN)n−r

which is [the binomial distribution].^{[5]} Randomized sampling with replacement from an urn with finite N has approximately the same effect as passage to the limit N→∞ without replacement (p. 75, notation converted).

4. Elementary Hypothesis Testing

Call the proposition containing the entirety of what our inference engine has observed and otherwise contains in its head X. Call the new data revealed to the robot in the context of the problem at hand D.

Any probability P(A|X) that is conditional on X alone is called a prior probability. But we caution that the term ‘prior’ is another of those terms from the distant past that can be inappropriate and misleading today. In the first place, it does not necessarily mean ‘earlier in time’. Indeed, the very concept of time is not in our general theory (although we may of course introduce it in a particular problem). The distinction is a purely logical one; any additional information beyond the immediate data D of the current problem is by definition ‘prior information’.

…

There is no single universal sure for assigning priors—the conversion of verbal prior information into numerical prior probabilities is an open-ended problem of logical analysis, to which we shall return many times. At present, four fairly general principles are known—group invariance, maximum entropy, marginalization, and coding theory—which have led to successful solutions to may different kinds of problems (pp. 87-8).

Let H stand for a hypothesis to be tested. Then, by the product rule,

P(H|DX)=P(H|X)P(D|HX)P(D|X)

The term P(H|X) is then our prior for the hypothesis. Left of the equals sign, the term P(H|DX) is called our posterior for the hypothesis, because it conditions on our data-at-hand D and our inference engine’s background memory X. The term P(D|HX) is called the likelihood—in other usages, the likelihood is also termed the ‘sampling distribution’ (p. 89). The overall equation, of course, whatever propositions are fed into it, is called Bayes’ theorem.

Testing Binary Hypotheses with Binary Data

The simplest nontrivial problem of hypothesis testing is the one where we have only two hypotheses to test and only two possible data values. Surprisingly, this turns out to be a realistic and valuable model of many important inference and decision problems (p. 90).

Bayes’ theorem also happily applies in the case where we are testing for H’s falsity:

P(¯H|DX)=P(¯H|X)P(D|¯HX)P(D|X)

The ratio of these two posteriors, for H and for ¯H, both conditional on D and X, is

P(H|DX)P(¯H|DX)=P(H|X)P(D|HX)P(¯H|X)P(D|¯HX)

and is called the odds on H, conditional on D and X (p. 90).

Taking the logarithm of the odds on a proposition enables the clean adding up of odds in this problem.^{[6]} Which logarithm base (and unit coefficient) we choose fixes our evidential unit. Jaynes likes evidential decibels (dB), as he thinks these are easy to intuitively interpret. Decibels are the unit you get with a logarithm base (and unit coefficient) of 10, meaning you use the expression 10log10P(H|DX)P(¯H|DX) dB. With a logarithm base of 2, you get evidential bits (b): you use log2P(H|DX)P(¯H|DX) b.

We shall describe a [concrete] problem of industrial quality control (although it could be phrased equally well as a problem of cryptography, chemical analysis, interpretation of a physics experiment, judging two economic theories, etc.). Following the example of Good (1950), we assume numbers which are not very realistic in order to elucidate some points of principle. Let the prior information X consist of the following [proposition]:

“We have 11 automatic machines turning out widgets, which pour out of the machines into 11 boxes. This example corresponds to a very early stage in the development of the widgets, because ten of the machines produce one in six defective. The 11th machine is even worse; it makes one in three defective. The output of each machine has been collected in an unlabeled box and stored in the warehouse.”

We choose one of the boxes and test a few widgets, classifying them as ‘good’ or ‘bad’. Our job is to decide whether we chose a box from the bad machine or not; that is, whether we are going to accept this batch or reject it (p. 95).

Let A be, “We chose a bad batch, with 13 defective,” and B be, “We chose a good batch, with 16 defective.” From our prior X, we know that one or the other is true:

¯A=B¯B=A

Because our prior is that there are 11 machines with no further information,

p(A|X)=111

and the evidence in decibels for A is

10log101111011 dB=−10 dB

(and the decibels for B, conversely, are 10 dB).

Evidently, in this problem the only properties of X that will be relevant for the calculation are just these numbers, ±10 dB. Any other kind of prior information which led to the same numbers would give us just the same mathematical problem from this point on (p. 94).

If we draw a broken widget from our box on the first draw, we add 10log10P(bad|AX)P(bad|¯AX)dB to our existing −10 dB for A. The inference engine knows that

P(bad|AX)=13

and that

P(bad|¯AX)=16

so this would mean adding 3 dB of evidence for A on a first bad draw.

If we’re sampling from a small batch of widgets, the probabilities for good and bad will draws will now update, as we saw in the section on sampling without replacement. If batch sizes are much larger (by, say, at least two orders of magnitude) than test sizes, our inference engines probabilities for good and bad don’t appreciably update (pp. 70-1, 94-5). Instead, the inference engine approximates this problem as sampling from a binomial distribution, as previously discussed. Thus, every bad draw will continue to constitute 3 dB of evidence for A.

Similarly, each good draw will constitute −0.97≈−1 dB of evidence against A (pp. 94-5).

Digression on Another Derivation

Because the hypothesis space of our inference engine in that problem was just {A,B}, it could not reason its way to any outside, further hypothesis C. I won’t review all of Jaynes’ work here, but the machinery above can be extended to a larger hypothesis space containing C.

As a brief preview, though, let C be the proposition, “We chose a horrifying batch, with 99100 defective.” Give C a starting evidential base of −60 dB—incredibly unlikely! Now, have our inference engine progressively sample more and more widgets from the box, having all the widgets come up defective. Its evidential judgements go as follows:

The inference engine initially comes to favor the bad batch hypothesis A over the good batch hypothesis B as it samples more and more uniformly broken widgets. But after enough broken widgets, the horrifying batch hypothesis C rises above even B!

Whenever the hypotheses [in a discrete hypothesis space] are separated by 10 dB or more, then multiple hypothesis testing reduces approximately to testing each hypothesis against a single alternative.

…

In summary, the role of our new hypothesis C was only to be held in abeyance until needed, like a fire extinguisher. In a normal testing situation it is ‘dead’, playing no part in the inference because its probability is and remains far below that of the other hypotheses. But a dead hypotheses can be resurrected to life by very unexpected data (pp. 104-5).

Notice that everything we’ve seen has ultimately boiled down to product and sum rule manipulations of expressions P(A|B)! There’s no mathematical split between sampling and hypothesis testing—and this suggests that the apparent conceptual split between the two is similarly illusory.

The conceptual viewpoint instead suggested is that all of these manipulations be thought of as logical implications of an inference engine’s observations. That’s where the book’s subtitle, The Logic of Science, comes from, and where Jaynes sources his constant refrain, “probability theory as extended logic.”

5. Queer Uses for Probability Theory

6. Elementary Parameter Estimation

Probability theory as extended logic is an exact mathematical system. That is, results derived from correct application of our rules without approximation have the property of exact results in any other area of mathematics: you can subject them to arbitrary extreme conditions and they continue to make sense (pp. 153-4).

…

It is not surprising that the binomial prior is more informative about the unsampled balls than are the data of a small sample; but actually it is more informative about them than are any amount of data; even after sampling 99% of the population, we are no wiser about the remaining 1%.

So what is the invisible strange property of the binomial prior? It is in some sense so ‘loose’ that it destroys the logical link between different members of the population. But on meditation we see that this is just what was implied by our scenario of the urn being filled by monkeys tossing in balls in such a way that each ball had independently the probability g of being red. Given that filling mechanism, then knowing that any given ball is in fact red, gives one no information whatsoever about any other ball. That is, P(R1R2|I)=P(R1|I)P(R2|I). This logical independence in the prior is preserved in the posterior distribution (p. 162).

…

Prior information can tell us whether some hypothesis provides a possible mechanism for the observed facts, consistent with the known laws of physics. If [the hypothesis] does not, then the fact that it accounts well for the data may give it a high likelihood, but it cannot give it any credence. A fantasy that invokes the labors of hordes of little invisible elves and pixies running about to generate the data would have just as high a likelihood; but it would still have no credence (p. 196).

7. The Central, Gaussian, or Normal Distribution

In probability theory, there sems to be a central, universal distribution

φ(x):=1√2πexp(−x22)

toward which all others gravitate under a very wide variety of operations—and which, once attained, remains stable under an even wider variety of operations...

This distribution is called the Gaussian, or normal, distribution, for historical reasons discussed below. Both names are inappropriate and misleading today; all the correct connotations would be conveyed if we called it, simply, the central distribution of probability theory (pp. 199-200, notation converted).

…

The most ubiquitous reason for using the Gaussian sampling distribution is not that the error frequencies are known to be—or assumed to be—Gaussian, but rather because those frequencies are unknown. One sees what a totally different outlook this is than that of Feller and Barnard; ‘normality’ was not an assumption of physical fact at all. It was a valid description of our state of knowledge. In most cases, had we done anything different, we would be making an unjustified, gratuitous assumption (violating one of our Chapter 1 desiderata of rationality) (p. 210).

…

The term ‘central limit theorem’… was introduced by George Pólya (1920), with the intention that the adjective ‘central’ was to modify the noun ‘theorem’; i.e. it is the limit theorem which is central to probability theory. Almost universally, students today think that ‘central’ modifies ‘limit’, so that it is instead a theorem about a ‘central limit’, whatever that means...

Our suggested terminology takes advantage of this; looked at in this way, the terms ‘central distribution’ and ‘central limit theorem’ both convey the right connotations to one hearing them for the first time. One can read ‘central limit’ as meaning a limit towards a central distribution, and will be invoking just the right intuitive picture (p. 242).

8. Sufficiency, Ancillarity, and All That

9. Repetitive Experiments: Probability and Frequency

10. Physics of ‘Random Experiments’

11. Discrete Prior Probabilities: The Entropy Principle

Shannon’s theorem: The only function H(p1,⋯,pn) satisfying the conditions we have imposed on a reasonable measure of ‘amount of uncertainty’ is

H(p1,⋯,pn)=−n∑i=1pilogb(pi)

[where the terms p1,⋯,pn are the prior probabilities of their index propositions A1,⋯,An, and you have your choice of log base b.] Accepting this interpretation, it follows that the distribution p1,⋯,pn which maximizes H(p1,⋯,pn), subject to constraints imposed by the available information, will represent the ‘most honest’ description of what the robot knows about the propositions A1,⋯,An.

…

The function H is called the entropy, or, better, the information entropy of the distribution {pi}.

…

We have seen the mathematical expression ∑plogp appearing incidentally in several previous chapters, generally in connection with the multinomial distribution; now it has acquired new meaning as a fundamental measure of how uniform a probability distribution is (pp. 348-51, notation converted).

12. Ignorance Priors and Transformation Groups

13. Decision Theory, Historical Background

14. Simple Applications of Decision Theory

15. Paradoxes of Probability Theory

16. Orthodox Methods: Historical Background

17. Principles and Pathology of Orthodox Statistics

18. The Ap Distribution and the Rule of Succession

Jaynes here wants to define the factorial of a negative integer to be infinite, as this will obviate the need for some restrictions and these equations will continue to yield meaningful results even when r>M, or similar.

Derivation of the Hypergeometric and Binomial Distributions

What is the robot’s probability for drawing exactly r red balls in n draws, regardless of order? Different orders of appearance of red and white balls are mutually exclusive possibilities, so we must sum over all of them; but since each term is equal to

P(Wr+1⋯Wr+w|R1⋯RrB)=(N−M)!(N−r−w)!(N−M−w)!(N−r)!

we merely multiply it by the binomial coefficient

(nr):=n!r!(n−r)!

which represents the number of possible orders of drawing r red balls in n draws (p. 54-5, notation converted).

Let A be the proposition, “Exactly r red balls are drawn in n draws.” We now define a function h

h(r|N,M,n):=P(A|B)=(Mr)(N−Mn−r)(Nn)

called the hypergeometric distribution (p. 55-6).

The [complexity] of the hypergeometric distribution arises because it is taking into account the changing contents of the urn; knowing the result of any draw changes the probability for red for any other draw. But if the number N of balls in the urn is very large compared with the number drawn, N≫n, then this probability changes very little, and in the limit N→∞ we should have a simpler result, free of such dependencies. To verify this, we write the hypergeometric distribution [as]

h(r|N,M,n)=1Nr(Mr)1Nn−r(N−Mn−r)1Nn(Nn)

The first factor [expands to]

1Nr(Mr)=1r!MN(MN−1N)(MN−2N)⋯(MN−r−1N)

and in the limit N→∞, M→∞, MN→f, we have

1Nr(Mr)→frr!

Likewise,

1Nn−r(N−Mn−r)→(1−f)n−r(n−r)!1Nn(Nn)→1n!

In principle, we should, of course, take the limit of the product… not the product of the limits. But… we have defined the factors so that each has its own independent limit, so the result is the same; the hypergeometric distribution does into

b(r|n,f):=(nr)fr(1−f)n−r

called the binomial distribution (p. 69-70, notation converted).

Actually, this further requires that every piece of data Di we convert into evidential bits (or whatever unit) be logically independentof every other piece of data we convert. I.e.,

## Probability Theory: The Logic of Science, Jaynes

Epistemic status: An idiosyncratic walkthrough of the beginning of a much larger textbook.(The vampire!)

Probability theory is the study of

idealized inference.In particular, it’s the study of a precise formal system that, effectively, generalizes propositional logic to the inductive setting. This formal system is adequate to capture huge swaths of common sense reasoning. Conform to the rules of this formal system, and your inferential power will always match or exceed that offered by received statistical techniques—both in practice and in the theoretical idealization. Or so Jaynes argues, anyways.There’s so much here. I’ll focus on Chapters 1-4, but will excerpt key bits from throughout.

## Contents and Notes

## 1. Plausible Reasoning

This is a book about

mathematically constructing an inference engine.The

ABlogical productmeans the conjunction A∧B.

The

A+Blogical summeans the disjunction A∨B.

Both operations are commutative in Boolean algebra, reflecting the fact that the order of the conjuncts or disjuncts in a conjunction or disjunction doesn’t change that sentence’s truth-value:

AB=BAA+B=B+AA proposition’s negation is denoted by a barred letter:

¯A:=¬A## The Basic Desiderata

Jaynes has four intuitive desiderata he wants out of his inference engine:

The

A|Bconditionalstands for some real number, the “conditional plausibility that A is true, given that B is true” (p. 17).

Also, the engine ought to reason in:

## 2. The Quantitative Rules

## The Product Rule

Going off of just the above desiderata, we’re going to build up the machinery of logical products. It follows from (I) that the logical product of propositions A and B conditional on some known background proposition C, denoted AB|C, is a real number.

If our background knowledge is changed from C to C′, such that one of the conjuncts becomes a little more likely, the conjunction AB|C′ must also become a little more likely. So F(x,y), with x,y∈R, must be a continuous monotonic increasing function of both x and y (p. 26).

Expanding out the conditional ABC|D with the help of this function F in two ways, we get

F(F(C|D,B|CD),A|BCD)=ABC|D=F(C|D,AB|CD)Let x=C|D, y=B|CD, and z=A|BCD here, with x,y,z∈R. This establishes another constraint on our function F:

F(F(x,y),z)=F(x,F(y,z))This equation flows from the associativity of Boolean algebra together with desideratum (IIIa) (p. 26-7).

After further manipulations, Jaynes fixes another equation

w(F(x,y))=w(x)w(y)where the function w is defined by

w(x)=exp(∫xk1H(x)dx)with the function H being arbitrary and k being some constant (p. 28).

^{[1]}“By it’s construction… w(x) must be a positive continuous monotonic function, increasing or decreasing according to the sign of H(x); at this stage it is otherwise arbitrary” (p. 29).Recall that

AB|C=F(A|BC,B|C)=F(B|AC,A|C)Applying w to all expressions in the equation, we get

w(AB|C)=w(F(A|BC,B|C))=w(F(B|AC,A|C))And using what we’ve just proven about w and F(x,y), we derive the

w(AB|C)=w(A|BC)w(B|C)=w(B|AC)w(A|C)product rule:It’s easy to translate back and forth between the scales [0,1] and [+∞,1] by just defining a new function w2

w2(x):=1w1(x)Thus, without loss of generality, Jaynes henceforth adopts the [0,1] convention for representing the gamut from known impossibility to known certainty (p. 30).

## The Sum Rule

Maybe you can see where Jaynes is going here. He isn’t just going to

assume as an axiomthat all disjoint, collectively exhaustive probabilities sum to 1, even though that claim seems plenty intuitively compelling. He’s going to again derive that from the above desiderata.We now have a functional S, as well as the product rule from earlier. The former will have to end up being consistent with the latter, and so the product rule will help to give this functional S some shape. I’ll skip over most of Jaynes’ derivation,

^{[2]}up to:Jaynes has hitherto carefully avoided using the term ‘probability,’ the conventional P(A|B) notation for it, or the intuitions behind it. But the machinery developed so far has been

P(A|B):=p(A|B)demonstratedto have all the requisite properties of probability! So Jaynes will now formally christen these values with the conventional symbolismas well as term them

probabilities. That is,the machinery of probability theory is hereby considered set up;all that remains is to demonstrate its inferential power.## 3. Elementary Sampling Theory

## Sampling without Replacement

A central refrain of Jaynes’ is thatIn this section he’ll talk about the probabilities of drawing certain balls from urns. The probabilistic properties of these balls and urns… aren’t physical characteristics of the balls, urns, an urn’s being stirred, and/or your hand reaching in at all. All the probabilities discussed are features of what an inference engine has observed about the world beforehand. If the inference engine observed that the top of the urn is all red balls, its probability for a red ball on the next draw is 1. If another inference engine did not get to make that observation, its probability of a red ball on the next draw is different!

thou shalt not committhe mind-projection fallacy.Relatedly, in our formalism, observe that while A|B∈R, A∉R. A is a proposition, not a number,

and soP(A)isn’t even a well-formed probability!Probabilities are defined by a hypothesis and a set of past observations, and are undefined without a memory of past observations.Let the symbol Ri stand for the proposition, “On the ith draw, a red ball comes up,” and Wi, the proposition, “On the ith draw, a white ball comes up.” Let the symbol B stand for the proposition, “An urn contains N balls, all identical except for color, with M of them red and N−M of them white. We will draw from the urn without replacement, and repeat until a target number of total draws n is reached.” Our inference engine can now generate

P(R1|B)=MNfor the probability that the first ball drawn is red, conditional on observing only that background setup proposition B.

What is the probability of two reds in a row coming up, conditional only on B? By the product rule:

P(R1R2|B)=P(R1|B)P(R2|R1B)Combining this with our previous equation:

P(R1R2|B)=MNP(R2|R1B)As for the second factor, the new background observation R1B asserts that one ball, and one

P(R2|R1B)=M−1N−1P(R1R2|B)=MNM−1N−1redball specifically, has been removed. Thus:These fractions can quickly become visually cluttered, depending on how general the proposition you’re asking about (left of the conditional bar) is. Remember that these busy fractions are just reduced products of many simpler fractions, and any strange extra terms appearing in the numerator and denominator are there

in order to cancel outterms over on the denominator or numerator, respectively.## Expectations

## Sampling with Replacement

Suppose the inference engine draws one ball from the urn, examines it,

and then returns it.Exactly how it returns the ball will logically determine much about what its next draw will look like. Setting that ball exactly on top of the heap means that the next draw will be the same ball, with probability 1. Leaving the ball somewhere in the top half of the heap and then sampling from that top half means the probability of a same-color draw must be elevated somewhat.The procedure the inference engine adopts is to embed the ball in the urn and then vigorously shake it. The inference engine

P(R2|W1B′)=MNremembersthat it drew a red or white on its previous draw. Now, however, that background proposition (right of the conditional bar) doesn’t logically fix much of anything about what will be drawn next. That replacement procedure logically decorrelated the inference engine’s remembered observations from its upcoming observation. So, as a best approximation, the inference engine declares its past observations irrelevant to the question of what it will next draw. In symbols, supposing that it drew white last time,where B′ represents the above reasoning and known problem setup.

## 4. Elementary Hypothesis Testing

Call the proposition containing the entirety of what our inference engine has observed and otherwise contains in its head X. Call the new data revealed to the robot in the context of the problem at hand D.

Let H stand for a hypothesis to be tested. Then, by the product rule,

P(H|DX)=P(H|X)P(D|HX)P(D|X)The term P(H|X) is then our prior for the hypothesis. Left of the equals sign, the term P(H|DX) is called our

posteriorfor the hypothesis, because it conditions on our data-at-hand Dandour inference engine’s background memory X. The term P(D|HX) is called thelikelihood—in other usages, the likelihood is also termed the ‘sampling distribution’ (p. 89). The overall equation, of course,whateverpropositions are fed into it, is calledBayes’ theorem.## Testing Binary Hypotheses with Binary Data

Bayes’ theorem also happily applies in the case where we are testing for H’s falsity:

P(¯H|DX)=P(¯H|X)P(D|¯HX)P(D|X)The ratio of these two posteriors, for H and for ¯H, both conditional on D and X, is

P(H|DX)P(¯H|DX)=P(H|X)P(D|HX)P(¯H|X)P(D|¯HX)and is called the

odds onH, conditional on D and X (p. 90).Taking the logarithm of the odds on a proposition enables the clean adding up of odds in this problem.

^{[6]}Which logarithm base (and unit coefficient) we choose fixes our evidential unit. Jaynes likesevidential decibels(dB), as he thinks these are easy to intuitively interpret. Decibels are the unit you get with a logarithm base (and unit coefficient) of 10, meaning you use the expression 10log10P(H|DX)P(¯H|DX) dB. With a logarithm base of 2, you getevidential bits(b): you use log2P(H|DX)P(¯H|DX) b.Let A be, “We chose a bad batch, with 13 defective,” and B be, “We chose a good batch, with 16 defective.” From our prior X, we know that one or the other is true:

¯A=B¯B=ABecause our prior is that there are 11 machines with no further information,

p(A|X)=111and the evidence in decibels for A is

10log101111011 dB=−10 dB(and the decibels for B, conversely, are 10 dB).

If we draw a broken widget from our box on the first draw, we add 10log10P(bad|AX)P(bad|¯AX)dB to our existing −10 dB for A. The inference engine knows that

P(bad|AX)=13and that

P(bad|¯AX)=16so this would mean adding 3 dB of evidence for A on a first bad draw.

If we’re sampling from a small batch of widgets, the probabilities for good and bad will draws will now update, as we saw in the section on sampling without replacement. If batch sizes are much larger (by, say, at least two orders of magnitude) than test sizes, our inference engines probabilities for good and bad don’t appreciably update (pp. 70-1, 94-5). Instead, the inference engine approximates this problem as sampling from a binomial distribution, as previously discussed. Thus,

everybad draw will continue to constitute 3 dB of evidence for A.Similarly, each good draw will constitute −0.97≈−1 dB of evidence against A (pp. 94-5).

## Digression on Another Derivation

Because the hypothesis space of our inference engine in that problem was just {A,B}, it could not reason its way to any outside, further hypothesis C. I won’t review all of Jaynes’ work here, but the machinery above can be extended to a larger hypothesis space containing C.

As a brief preview, though, let C be the proposition, “We chose a horrifying batch, with 99100 defective.” Give C a starting evidential base of −60 dB—incredibly unlikely! Now, have our inference engine progressively sample more and more widgets from the box, having all the widgets come up defective. Its evidential judgements go as follows:

The inference engine initially comes to favor the bad batch hypothesis A over the good batch hypothesis B as it samples more and more uniformly broken widgets. But after enough broken widgets, the horrifying batch hypothesis C rises above even B!

Notice that

everythingwe’ve seen has ultimately boiled down to product and sum rule manipulations of expressions P(A|B)! There’s nomathematicalsplit between sampling and hypothesis testing—and this suggests that the apparent conceptual split between the two is similarly illusory.The conceptual viewpoint instead suggested is that all of these manipulations be thought of as

logical implications of an inference engine’s observations. That’swhere the book’s subtitle,The Logic of Science,comes from, and where Jaynes sources his constant refrain, “probability theory as extended logic.”## 5. Queer Uses for Probability Theory

## 6. Elementary Parameter Estimation

## 7. The Central, Gaussian, or Normal Distribution

## 8. Sufficiency, Ancillarity, and All That

## 9. Repetitive Experiments: Probability and Frequency

## 10. Physics of ‘Random Experiments’

## 11. Discrete Prior Probabilities: The Entropy Principle

## 12. Ignorance Priors and Transformation Groups

## 13. Decision Theory, Historical Background

## 14. Simple Applications of Decision Theory

## 15. Paradoxes of Probability Theory

## 16. Orthodox Methods: Historical Background

## 17. Principles and Pathology of Orthodox Statistics

## 18. The Ap Distribution and the Rule of Succession

## 19. Physical Measurements

## 20. Model Comparison

## 21. Outliers and Robustness

## 22. Introduction to Communication Theory

exp(x):=ax, where a is some fixed positive real number besides 1.

The proof is

originallydue to R. T. Cox, inThe Algebra of Probable Inference(1961).fn(x):=(f(x))n

Jaynes here wants to define

the factorial of a negative integer to be infinite,as this will obviate the need for some restrictions and these equations will continue to yield meaningful results even when r>M, or similar.## Derivation of the Hypergeometric and Binomial Distributions

Let A be the proposition, “Exactly r red balls are drawn in n draws.” We now define a function h

h(r|N,M,n):=P(A|B)=(Mr)(N−Mn−r)(Nn)called the

hypergeometric distribution(p. 55-6).Actually, this further requires that every piece of data Di we convert into evidential bits (or whatever unit) be logically independentof every other piece of data we convert. I.e.,

P(Dj|DkHX)=P(Dj|HX)for all j, k, j≠k (p. 91-2).