Probability Theory: The Logic of Science, Jaynes
Epistemic status: An idiosyncratic walkthrough of the beginning of a much larger textbook.
Probability theory is the study of idealized inference. In particular, it’s the study of a precise formal system that, effectively, generalizes propositional logic to the inductive setting. This formal system is adequate to capture huge swaths of common sense reasoning. Conform to the rules of this formal system, and your inferential power will always match or exceed that offered by received statistical techniques—both in practice and in the theoretical idealization. Or so Jaynes argues, anyways.
There’s so much here. I’ll focus on Chapters 1-4, but will excerpt key bits from throughout.
Contents and Notes
1. Plausible Reasoning
This is a book about mathematically constructing an inference engine.
Our robot is going to reason about propositions. As already indicated above, we shall denote various propositions by italicized capital letters, , and for the time being we must require that any proposition used must have, to the robot an unambiguous meaning and must be of the simple, definite logical type that must be either true or false… We do not require that the truth or falsity of such an ‘Aristotelian proposition’ be ascertainable by any feasible investigation; indeed, our inability to do this is usually just the reason why we need the robot’s help.
To state these ideas more formally, we introduce some notation of the usual symbolic logic, or Boolean algebra, so called because George Boole (1854) introduced a notation similar to the following (p. 9).
The logical product
means the conjunction .
The logical sum
means the disjunction .
Both operations are commutative in Boolean algebra, reflecting the fact that the order of the conjuncts or disjuncts in a conjunction or disjunction doesn’t change that sentence’s truth-value:
A proposition’s negation is denoted by a barred letter:
The Basic Desiderata
Jaynes has four intuitive desiderata he wants out of his inference engine:
To each proposition about which it reasons, our robot must assign some degree of plausibility, based on the evidence we have given it; and whenever it receives new evidence it must revise these assessments to take that new evidence into account. In order that these plausibility assignments can be stored and modified in the circuits of its brain, they must be associated with some definite physical quantity, such as voltage or pulse duration or a binary coded number, etc. -- however our engineers want to design the details. For present purposes, this means that there will have to be some kind of association between degrees of plausibility and real numbers:
(I) Degrees of plausibility are represented by real numbers.
Desideratum (I) is practically forced on us by the requirement that the robot’s brain must operate by the carrying out of some definite physical process. However, it will appear… that it is also required theoretically; we do not see the possibility of any consistent theory without a property that is equivalent functionally to desideratum (I) (p. 17).
stands for some real number, the “conditional plausibility that is true, given that is true” (p. 17).
Also, the engine ought to reason in:
(II) Qualitative correspondence with common sense.
Finally we want to give our robot another desirable property for which honest people strive without always attaining: that it always reasons consistently. By this we mean just the three common colloquial meanings of the word ‘consistent’:
(IIIa) If a conclusion can be reasoned out in more than one way, then every possible way must lead to the same result.
(IIIb) The robot always takes into account all of the evidence it has relevant to a question. It does not arbitrarily ignore some of the information, basing its conclusions only on what remains. In other words, the robot is nonideological.
(IIIc) The robot always represents equivalent states of knowledge by equivalent plausibility assignments. That is, if in two problems the robot’s state of knowledge is the same (except perhaps for the labeling of the propositions), then it must assign the same plausibilities to both.
Desiderata (I), (II), and (IIIa) are the basic ‘structural’ requirements on the inner workings of our robot’s brain, while (IIIb) and (IIIc) are ‘interface’ conditions which show how the robot’s behavior should relate to the outer world.
At this point, most students are surprised to learn that our search for desiderata is at an end. The above conditions, it turns out, uniquely determine the rules by which our robot must reason; i.e. there is only one set of mathematical operations for manipulating plausibilities which has all these properties (pp. 18-9).
2. The Quantitative Rules
The Product Rule
Going off of just the above desiderata, we’re going to build up the machinery of logical products. It follows from (I) that the logical product of propositions and conditional on some known background proposition , denoted , is a real number.
In order for to be a true proposition, it is necessary that is true. Thus the plausibility should be involved. In addition, if is true, it is further necessary that should be true; so the plausibility is also needed. But if is false, then of course is false independently of whatever one knows about , as expressed by ; if the robot reasons first about , then the plausibility of will be relevant only if is true. Thus, if the robot has and it will not need . That would tell it nothing about that is did not have already.
Similarly, and are not needed; whatever plausibility or might have in the absence of information could not be relevant to judgements of a case in which the robot knows that is true...
Of course, singe the logical product is commutative, , we could interchange and in the above statements; i.e. knowledge of and would serve equally well to determine . That the robot must obtain the same value for from either procedure is one of our conditions of consistency, desideratum (IIIa).
We can state this in a more definite form. will be some function of and :
(pp. 24-5, notation converted).
If our background knowledge is changed from to , such that one of the conjuncts becomes a little more likely, the conjunction must also become a little more likely. So , with , must be a continuous monotonic increasing function of both and (p. 26).
Expanding out the conditional with the help of this function in two ways, we get
Let , , and here, with . This establishes another constraint on our function :
This equation flows from the associativity of Boolean algebra together with desideratum (IIIa) (p. 26-7).
After further manipulations, Jaynes fixes another equation
where the function is defined by
with the function being arbitrary and being some constant (p. 28). “By it’s construction… must be a positive continuous monotonic function, increasing or decreasing according to the sign of ; at this stage it is otherwise arbitrary” (p. 29).
Applying to all expressions in the equation, we get
And using what we’ve just proven about and , we derive the product rule:
The requirements of qualitative correspondence with common sense impose further conditions on the function . For example, [in the product rule,] suppose that is certain, given . Then in the ‘logical environment’ produced by knowledge of , the propositions and are the same, in the sense that one is true if and only if the other is true. By our most primitive axiom of all… propositions with the same truth value must have equal plausibility:
and also we will have
because if is already certain given (i.e. implies ), then, given any other information which does not contradict , it is still certain. In this case...
and this must hold no matter how plausible or implausible is to the robot. So our function must have the property that
Now suppose that is impossible, given . Then the proposition is also impossible given :
and if is already impossible given (i.e. implies ), then, given any further information which does not contradict , would still be impossible:
In this case...
and again this equation must hold no matter what plausibility might have. There are only two possible values of w(A|C) that could satisfy this condition: it could be zero or …
In summary, qualitative correspondence with common sense requires that be a positive continuous monotonic function. It may be either increasing or decreasing. If it is increasing, it must range from zero for impossibility up to one for certainty. If it is decreasing, it must range from for impossibility down to one for certainty. Thus far, our conditions say nothing about how it varies between these limits (p. 29-30).
It’s easy to translate back and forth between the scales and by just defining a new function
Thus, without loss of generality, Jaynes henceforth adopts the convention for representing the gamut from known impossibility to known certainty (p. 30).
The Sum Rule
The logical product is always false, the logical sum always true. The plausibility that is false must depend in some way on the plausibility that it is true. If we define , , there must exist some functional relation
Maybe you can see where Jaynes is going here. He isn’t just going to assume as an axiom that all disjoint, collectively exhaustive probabilities sum to , even though that claim seems plenty intuitively compelling. He’s going to again derive that from the above desiderata.
We now have a functional , as well as the product rule from earlier. The former will have to end up being consistent with the latter, and so the product rule will help to give this functional some shape. I’ll skip over most of Jaynes’ derivation, up to:
Our results up to this point can be summarized as follows. Associativity of the logical product requires that some monotonic function of the plausibility must obey the product rule… Our result… states that this same function must also obey a sum rule:
for some positive . Of course, the product rule itself could be written equally well as
but then we see that the value of m is actually irrelevant; for whatever value is chosen, we can define a new function
and our rules take the form
Are further relations needed to yield a complete set of rules for plausible inference, adequate to determine the plausibility of any logic function from those of ? We have, in the product rule [and the sum rule], formulas for the plausibility of the conjunction and the negation … Conjunction and negation are an adequate set of operations, from which all logic functions can be constructed.
Therefore, one would conjecture that our search for basic rules should be finished; it ought to be possible, by repeated applications of the product rule and sum rule, to arrive at the plausibility of any proposition in the Boolean algebra generated by .
To verify this, we seek first a formula for the logical sum . Applying the product rule and sum rule repeatedly, we have
[This is called the generalized sum rule] (pp. 33-4, notation converted).
Jaynes has hitherto carefully avoided using the term ‘probability,’ the conventional notation for it, or the intuitions behind it. But the machinery developed so far has been demonstrated to have all the requisite properties of probability! So Jaynes will now formally christen these values with the conventional symbolism
as well as term them probabilities. That is, the machinery of probability theory is hereby considered set up; all that remains is to demonstrate its inferential power.
3. Elementary Sampling Theory
If on background information the hypotheses are mutually exclusive and exhaustive, and does not favor any one of them over any other, then
[From this equation and the generalized sum rule, we can derive] the Bernoulli urn rule: if specifies that is true on some subset of the , and false on the remaining , then
Essentially all of conventional probability theory as currently taught, plus many important results that are often thought to lie beyond the domain of probability theory, can be derived from [just this] foundation (p. 51, notation converted).
Sampling without Replacement
A central refrain of Jaynes’ is that thou shalt not commit the mind-projection fallacy. In this section he’ll talk about the probabilities of drawing certain balls from urns. The probabilistic properties of these balls and urns… aren’t physical characteristics of the balls, urns, an urn’s being stirred, and/or your hand reaching in at all. All the probabilities discussed are features of what an inference engine has observed about the world beforehand. If the inference engine observed that the top of the urn is all red balls, its probability for a red ball on the next draw is . If another inference engine did not get to make that observation, its probability of a red ball on the next draw is different!
Relatedly, in our formalism, observe that while , . is a proposition, not a number, and so isn’t even a well-formed probability! Probabilities are defined by a hypothesis and a set of past observations, and are undefined without a memory of past observations.
Let the symbol stand for the proposition, “On the th draw, a red ball comes up,” and , the proposition, “On the th draw, a white ball comes up.” Let the symbol stand for the proposition, “An urn contains balls, all identical except for color, with of them red and of them white. We will draw from the urn without replacement, and repeat until a target number of total draws is reached.” Our inference engine can now generate
for the probability that the first ball drawn is red, conditional on observing only that background setup proposition .
What is the probability of two reds in a row coming up, conditional only on ? By the product rule:
Combining this with our previous equation:
As for the second factor, the new background observation asserts that one ball, and one red ball specifically, has been removed. Thus:
Continuing in this way, the probability for red on the first consecutive draws is
The probability for white on the first draws is similar but for the interchange of and :
Then, the probability for white on draws given that we got red on the first draws [is]:
and so, by the product rule, the probability for obtaining red followed by white in draws [is],
(pp. 53-4, notation converted).
These fractions can quickly become visually cluttered, depending on how general the proposition you’re asking about (left of the conditional bar) is. Remember that these busy fractions are just reduced products of many simpler fractions, and any strange extra terms appearing in the numerator and denominator are there in order to cancel out terms over on the denominator or numerator, respectively.
If a variable quantity can take on the particular values in mutually exclusive and exhaustive situations, and the robot assigns corresponding probabilities to them, then the quantity
is called the expectation… It is a weighed average of the possible values, weighted according to their probabilities.
When the fraction of red balls is known, then the Bernoulli urn rule applies and . When is unknown, the probability for red is the expectation of :
(p. 67, notation converted).
Sampling with Replacement
Suppose the inference engine draws one ball from the urn, examines it, and then returns it. Exactly how it returns the ball will logically determine much about what its next draw will look like. Setting that ball exactly on top of the heap means that the next draw will be the same ball, with probability . Leaving the ball somewhere in the top half of the heap and then sampling from that top half means the probability of a same-color draw must be elevated somewhat.
The procedure the inference engine adopts is to embed the ball in the urn and then vigorously shake it. The inference engine remembers that it drew a red or white on its previous draw. Now, however, that background proposition (right of the conditional bar) doesn’t logically fix much of anything about what will be drawn next. That replacement procedure logically decorrelated the inference engine’s remembered observations from its upcoming observation. So, as a best approximation, the inference engine declares its past observations irrelevant to the question of what it will next draw. In symbols, supposing that it drew white last time,
where represents the above reasoning and known problem setup.
This is not just a repetition of what we learned [earlier]; what is new here is that the result now holds whatever information the robot may have about what happened in the other trials. This leads us to write the probability for drawing exactly red balls in trials, regardless of order, as
which is [the binomial distribution]. Randomized sampling with replacement from an urn with finite has approximately the same effect as passage to the limit without replacement (p. 75, notation converted).
4. Elementary Hypothesis Testing
Call the proposition containing the entirety of what our inference engine has observed and otherwise contains in its head . Call the new data revealed to the robot in the context of the problem at hand .
Any probability that is conditional on alone is called a prior probability. But we caution that the term ‘prior’ is another of those terms from the distant past that can be inappropriate and misleading today. In the first place, it does not necessarily mean ‘earlier in time’. Indeed, the very concept of time is not in our general theory (although we may of course introduce it in a particular problem). The distinction is a purely logical one; any additional information beyond the immediate data of the current problem is by definition ‘prior information’.
There is no single universal sure for assigning priors—the conversion of verbal prior information into numerical prior probabilities is an open-ended problem of logical analysis, to which we shall return many times. At present, four fairly general principles are known—group invariance, maximum entropy, marginalization, and coding theory—which have led to successful solutions to may different kinds of problems (pp. 87-8).
Let stand for a hypothesis to be tested. Then, by the product rule,
The term is then our prior for the hypothesis. Left of the equals sign, the term is called our posterior for the hypothesis, because it conditions on our data-at-hand D and our inference engine’s background memory . The term is called the likelihood—in other usages, the likelihood is also termed the ‘sampling distribution’ (p. 89). The overall equation, of course, whatever propositions are fed into it, is called Bayes’ theorem.
Testing Binary Hypotheses with Binary Data
The simplest nontrivial problem of hypothesis testing is the one where we have only two hypotheses to test and only two possible data values. Surprisingly, this turns out to be a realistic and valuable model of many important inference and decision problems (p. 90).
Bayes’ theorem also happily applies in the case where we are testing for ’s falsity:
The ratio of these two posteriors, for and for , both conditional on and , is
and is called the odds on , conditional on and (p. 90).
Taking the logarithm of the odds on a proposition enables the clean adding up of odds in this problem. Which logarithm base (and unit coefficient) we choose fixes our evidential unit. Jaynes likes evidential decibels (dB), as he thinks these are easy to intuitively interpret. Decibels are the unit you get with a logarithm base (and unit coefficient) of , meaning you use the expression dB. With a logarithm base of 2, you get evidential bits (b): you use b.
We shall describe a [concrete] problem of industrial quality control (although it could be phrased equally well as a problem of cryptography, chemical analysis, interpretation of a physics experiment, judging two economic theories, etc.). Following the example of Good (1950), we assume numbers which are not very realistic in order to elucidate some points of principle. Let the prior information consist of the following [proposition]:
“We have automatic machines turning out widgets, which pour out of the machines into boxes. This example corresponds to a very early stage in the development of the widgets, because ten of the machines produce one in six defective. The th machine is even worse; it makes one in three defective. The output of each machine has been collected in an unlabeled box and stored in the warehouse.”
We choose one of the boxes and test a few widgets, classifying them as ‘good’ or ‘bad’. Our job is to decide whether we chose a box from the bad machine or not; that is, whether we are going to accept this batch or reject it (p. 95).
Let be, “We chose a bad batch, with defective,” and B be, “We chose a good batch, with defective.” From our prior , we know that one or the other is true:
Because our prior is that there are machines with no further information,
and the evidence in decibels for is
(and the decibels for , conversely, are dB).
Evidently, in this problem the only properties of that will be relevant for the calculation are just these numbers, dB. Any other kind of prior information which led to the same numbers would give us just the same mathematical problem from this point on (p. 94).
If we draw a broken widget from our box on the first draw, we add dB to our existing dB for . The inference engine knows that
so this would mean adding dB of evidence for on a first bad draw.
If we’re sampling from a small batch of widgets, the probabilities for good and bad will draws will now update, as we saw in the section on sampling without replacement. If batch sizes are much larger (by, say, at least two orders of magnitude) than test sizes, our inference engines probabilities for good and bad don’t appreciably update (pp. 70-1, 94-5). Instead, the inference engine approximates this problem as sampling from a binomial distribution, as previously discussed. Thus, every bad draw will continue to constitute dB of evidence for .
Similarly, each good draw will constitute dB of evidence against (pp. 94-5).
Digression on Another Derivation
Because the hypothesis space of our inference engine in that problem was just , it could not reason its way to any outside, further hypothesis . I won’t review all of Jaynes’ work here, but the machinery above can be extended to a larger hypothesis space containing .
As a brief preview, though, let be the proposition, “We chose a horrifying batch, with defective.” Give a starting evidential base of dB—incredibly unlikely! Now, have our inference engine progressively sample more and more widgets from the box, having all the widgets come up defective. Its evidential judgements go as follows:
The inference engine initially comes to favor the bad batch hypothesis over the good batch hypothesis as it samples more and more uniformly broken widgets. But after enough broken widgets, the horrifying batch hypothesis rises above even !
Whenever the hypotheses [in a discrete hypothesis space] are separated by dB or more, then multiple hypothesis testing reduces approximately to testing each hypothesis against a single alternative.
In summary, the role of our new hypothesis was only to be held in abeyance until needed, like a fire extinguisher. In a normal testing situation it is ‘dead’, playing no part in the inference because its probability is and remains far below that of the other hypotheses. But a dead hypotheses can be resurrected to life by very unexpected data (pp. 104-5).
Notice that everything we’ve seen has ultimately boiled down to product and sum rule manipulations of expressions ! There’s no mathematical split between sampling and hypothesis testing—and this suggests that the apparent conceptual split between the two is similarly illusory.
The conceptual viewpoint instead suggested is that all of these manipulations be thought of as logical implications of an inference engine’s observations. That’s where the book’s subtitle, The Logic of Science, comes from, and where Jaynes sources his constant refrain, “probability theory as extended logic.”
5. Queer Uses for Probability Theory
6. Elementary Parameter Estimation
Probability theory as extended logic is an exact mathematical system. That is, results derived from correct application of our rules without approximation have the property of exact results in any other area of mathematics: you can subject them to arbitrary extreme conditions and they continue to make sense (pp. 153-4).
It is not surprising that the binomial prior is more informative about the unsampled balls than are the data of a small sample; but actually it is more informative about them than are any amount of data; even after sampling % of the population, we are no wiser about the remaining %.
So what is the invisible strange property of the binomial prior? It is in some sense so ‘loose’ that it destroys the logical link between different members of the population. But on meditation we see that this is just what was implied by our scenario of the urn being filled by monkeys tossing in balls in such a way that each ball had independently the probability of being red. Given that filling mechanism, then knowing that any given ball is in fact red, gives one no information whatsoever about any other ball. That is, . This logical independence in the prior is preserved in the posterior distribution (p. 162).
Prior information can tell us whether some hypothesis provides a possible mechanism for the observed facts, consistent with the known laws of physics. If [the hypothesis] does not, then the fact that it accounts well for the data may give it a high likelihood, but it cannot give it any credence. A fantasy that invokes the labors of hordes of little invisible elves and pixies running about to generate the data would have just as high a likelihood; but it would still have no credence (p. 196).
7. The Central, Gaussian, or Normal Distribution
In probability theory, there sems to be a central, universal distribution
toward which all others gravitate under a very wide variety of operations—and which, once attained, remains stable under an even wider variety of operations...
This distribution is called the Gaussian, or normal, distribution, for historical reasons discussed below. Both names are inappropriate and misleading today; all the correct connotations would be conveyed if we called it, simply, the central distribution of probability theory (pp. 199-200, notation converted).
The most ubiquitous reason for using the Gaussian sampling distribution is not that the error frequencies are known to be—or assumed to be—Gaussian, but rather because those frequencies are unknown. One sees what a totally different outlook this is than that of Feller and Barnard; ‘normality’ was not an assumption of physical fact at all. It was a valid description of our state of knowledge. In most cases, had we done anything different, we would be making an unjustified, gratuitous assumption (violating one of our Chapter 1 desiderata of rationality) (p. 210).
The term ‘central limit theorem’… was introduced by George Pólya (1920), with the intention that the adjective ‘central’ was to modify the noun ‘theorem’; i.e. it is the limit theorem which is central to probability theory. Almost universally, students today think that ‘central’ modifies ‘limit’, so that it is instead a theorem about a ‘central limit’, whatever that means...
Our suggested terminology takes advantage of this; looked at in this way, the terms ‘central distribution’ and ‘central limit theorem’ both convey the right connotations to one hearing them for the first time. One can read ‘central limit’ as meaning a limit towards a central distribution, and will be invoking just the right intuitive picture (p. 242).
8. Sufficiency, Ancillarity, and All That
9. Repetitive Experiments: Probability and Frequency
10. Physics of ‘Random Experiments’
11. Discrete Prior Probabilities: The Entropy Principle
Shannon’s theorem: The only function satisfying the conditions we have imposed on a reasonable measure of ‘amount of uncertainty’ is
[where the terms are the prior probabilities of their index propositions , and you have your choice of log base .] Accepting this interpretation, it follows that the distribution which maximizes , subject to constraints imposed by the available information, will represent the ‘most honest’ description of what the robot knows about the propositions .
The function is called the entropy, or, better, the information entropy of the distribution .
We have seen the mathematical expression appearing incidentally in several previous chapters, generally in connection with the multinomial distribution; now it has acquired new meaning as a fundamental measure of how uniform a probability distribution is (pp. 348-51, notation converted).
12. Ignorance Priors and Transformation Groups
13. Decision Theory, Historical Background
14. Simple Applications of Decision Theory
15. Paradoxes of Probability Theory
16. Orthodox Methods: Historical Background
17. Principles and Pathology of Orthodox Statistics
18. The Distribution and the Rule of Succession
19. Physical Measurements
20. Model Comparison
21. Outliers and Robustness
22. Introduction to Communication Theory
, where is some fixed positive real number besides .
The proof is originally due to R. T. Cox, in The Algebra of Probable Inference (1961).
Jaynes here wants to define the factorial of a negative integer to be infinite, as this will obviate the need for some restrictions and these equations will continue to yield meaningful results even when , or similar.
Derivation of the Hypergeometric and Binomial Distributions
What is the robot’s probability for drawing exactly r red balls in n draws, regardless of order? Different orders of appearance of red and white balls are mutually exclusive possibilities, so we must sum over all of them; but since each term is equal to
we merely multiply it by the binomial coefficient
which represents the number of possible orders of drawing red balls in draws (p. 54-5, notation converted).
Let be the proposition, “Exactly red balls are drawn in draws.” We now define a function
called the hypergeometric distribution (p. 55-6).
The [complexity] of the hypergeometric distribution arises because it is taking into account the changing contents of the urn; knowing the result of any draw changes the probability for red for any other draw. But if the number of balls in the urn is very large compared with the number drawn, , then this probability changes very little, and in the limit we should have a simpler result, free of such dependencies. To verify this, we write the hypergeometric distribution [as]
The first factor [expands to]
and in the limit , , , we have
In principle, we should, of course, take the limit of the product… not the product of the limits. But… we have defined the factors so that each has its own independent limit, so the result is the same; the hypergeometric distribution does into
called the binomial distribution (p. 69-70, notation converted).
Actually, this further requires that every piece of data we convert into evidential bits (or whatever unit) be logically independent of every other piece of data we convert. I.e.,
for all , , (p. 91-2).