Insights from “All of Statistics”: Probability
This post is something like an abridged summary. It’s definitely not a self-contained introduction to the field; I’ve included all important subjects as sections mostly to remind myself, but I skip over many of them and make no effort to introduce every concept I write about. The primary purpose is learning by writing; beyond that, it may be useful for people already familiar with statistics to refresh their knowledge.
I’m also including more of my own takes than I’ve done in my previous posts. For example, if I sound dismissive of Frequentism, this is not the book’s fault.
My verdict on the book’s quality that it’s one of the weakest ones Miri’s guide (but the competition is tough). This is mostly based on my recurring feeling that ‘this could have been explained better’. Nonetheless, I’m not familiar with a better book on statistics.
The book is structured in three parts: (1) Probability, (2) Statistical Inference, and (3) Statistical Models and Methods. Since this post was getting too long, I’ve decided to structure my summary in the same way: this post covers probability, the other post covers Statistical Inference, and there may be a third post in the future covering Statistical Models and Methods.
1. Probability spaces and distributions
The general notion of a probability space includes -algebras, but the book gets by without them. Thus, a probability space is either discrete with a probability mass function (pmf) or continuous with a probability density function (pdf). I denote both by writing . In both cases, determines a distribution that assigns probability mass to subsets of the sample space .
1.3. Finite Sample Spaces
1.4. Conditional Probability
Given two events , the formula everyone learns in high school is
In university courses, it may also be mentioned that is a probability space for any nonempty set .
I prefer to take that as the starting point, i.e., “evaluate event in the probability distribution over ” rather than “plug into the formula above”. This seems to be what people do when they’re not trying to do math. If I hear that a die came up either or or and consider the probability that it’s at least a , I don’t think “let’s take the intersection of and ; let’s compute ; let’s compute ; let’s divide to get ”. Instead, I think something like, “it can’t be 6, there are three remaining (equally probable) possibilities, so it’s ”. Because of this, I like the notation rather than .
1.5. Joint Distributions
1.6. Marginal Distributions
1.7. Bayes’ Theorem
See this amazing guide.
2. Random Variables
A random variable (RV) is a way to measure properties of events that we care about. If denotes all results of throwing five dice, a random variable may count the sum of results, the maximum result, the minimum result, or anything else we are interested in.
The traditional formalism defines a RV as a function .
In advanced probability theory, the sample spaces soon slide into the background. To see why, consider a random variable that measures how often patients get surgery. In this case, the probability space of is the set of all possible data sets that we could have obtained by collecting data. Trying to define and deal with this space is impractical, so instead, one generally talks in terms of RVs and their distributions.
To define this formally, let with probability measure on . Then, we get a new probability space with probability measure defined by for any . This works regardless of whether is given by a pmf or pdf. In fact, by the time one gets to continuous spaces, the sample space has already disappeared.
There are two related points I’ve been thinking about while working through the book. The first is a formal detail that most people wouldn’t care about (perhaps because I am an extreme type 1 mathematician). Given the formal definition of a RV as a function , the RV doesn’t ‘know’ its distribution. It also doesn’t know enough to determine the value of statistical functionals like . Now consider the notation . This looks like a statement about , but it doesn’t determine what function is—in fact, the only constraint it puts on the sample space is that . In this formalism, is really a statement about the distribution of .
To resolve this, I’ve transitioned to thinking of any random variable as a pair , where is the usual function and the entire (discrete or continuous or general) probability space. Everyone already talks about RVs as if they knew their distributions, so this formalism simply adopts that convention. Now, really is a statement about . The world is well again.
The second point is that the distribution does not fully characterize a RV, and this begins to matter as soon as random variables are combined. To demonstrate this, suppose we have two RVs . Let by the pdf of and be the pdf of . What is the pdf of (the RV defined by )? This question is impossible to answer without knowing how and map elements from their probability space onto : For example, suppose :
if , then is continuous with pdf
if (this does not violate the assumption that and are equally distributed), then is discrete with pmf and for
if and are independent, then is continuous with pdf
In practice, one either has the assumption that and are independent (which is defined by the condition that, for any sets , the events and are independent), or one constructs as a function of as above. However, those are only a small subset of all possible relationships that two RVs can have. In general, the distribution of can take arbitrary forms. Thus, one needs to know both the distribution and the mapping to characterize a RV.
2.2. Important Distributions
I think it’s an underappreciated aspect of probability theory how much it builds on top of itself. So many of the important distributions are derived out of prior distributions.
The simplest possible distribution is given by and . We write . As with all of these, we don’t specify the sample space .
A Binomial distribution is a Bernoulli distribution sampled times where counts the number of samples that came up 1. I call them “hits” and the others “misses”. We write and have the formula
It’s not too difficult to reason why this formula has this exact form.
Instead of repeating Bernoulli experiments a fixed number of times, we now repeat them until we miss once. We write and have the formula
Unfortunately, people couldn’t agree on whether we should count the number of hits or the number of samples (which is always one more), so there are two forms with slightly different formulas. In the above, I count the number of hits.
Instead of repeating Bernoulli experiments a fixed number of times, or until we get one miss, we now repeat them infinitely often but decrease the probability of hitting so that the expected number of hits remains constant. Our hit probability becomes where is the expected number of hits. We write and have the formula
Computing this limit is not trivial but doable.
These are all discrete distributions. The first two are finite, and the latter two countably infinite. The step to continuous distributions is a bigger change.
Instead of repeating Bernoulli experiments a fixed number of times, or until we get a miss, or infinitely often while holding the expected number of hits constant, we now repeat it infinitely often without holding the expected number of hits constant. If remains the parameter of our Bernoulli distribution, we get
for close to .
The mean and variance of a binomial distribution are and , respectively. That means one parameter is enough to control them both, and we can see both of those terms in the formula above. However, it turns out you can also tweak this distribution so that they come apart.
For a proper normal distribution, we write and have the equation
and this is, by many measures, the most important distribution. Unfortunately, it’s also one of the hardest to compute probabilities with analytically.
With constant pdf in , and for . We write
With pdf where is the Gamma function. I’m yet to need the explicit formula for .
Among other things, the beta distribution can describe
the distribution of the -th highest number across RVs sampled with the uniform distribution
if with unknown, the pdf of after doing a Bayesian update on samples of , provided that the prior of was either uniform or itself a Beta distribution.
2.3. Statistical Functionals
A statistical functional is an operator from the distribution of one or more random variable(s) to . In this case, we consider only statistical functionals that return a single real number that provides some information about the distribution. Statistical functionals aren’t functions because you can’t define their domain, but I believe they are what’s called class functions in set theory.
You might think that only the expected value and variance are important, but in fact, there are a surprising number of relevant statistical functionals. For most of them, closed formulas expressing their results in terms of the distribution’s parameters are known (and in many cases, deriving them isn’t too difficult).
A possible definition for the expected value is
(This also has a continuous version). Of course, as I’ve gone over before, no one thinks in terms of ‘events from the sample space and their probability’, but only ‘possible outcomes of and their probability’. Thus, we instead have the formula
One can write instead of , but I like this version as a reminder that we’re looking at image points.
An advantage of this formula is the following result:
Again, this works because we already think in terms of the function that gives the probability for results. For the original expectation, we ask “for each outcome , multiply with the probability of ”; now, we ask “for each outcome , multiply the transformed outcome with the probability of ”.
One often writes or just for
The equation holds whenever and are independent. On the other hand, the equation always holds, and I find that people (in everyday life) tend to make arguments that directly contradict this rule. One example is the argument that we should be indifferent between two actions because most of the relevant variables are unknown. (The known ones already increase the expected value of one action over the other, unless you specifically suspect the unknown ones to skew into the other direction.) A related argument is that high error bars/a possibility of failure implies that taking an action isn’t worth it.
One thing I would do if I were teaching a class on probability is to ask students how the ‘variance’ of a RV should be defined before revealing the textbook definition. One might reasonably guess , but mathematicians have an immortal preference of squaring over taking absolute values, so it’s instead. One can prove that .
2.3.3. Standard Deviation
This is just the square of the variance. One often writes or just for .
The median is the number at the center of the distribution, i.e., the such that . Since such an may not exist if is discrete p, we define it as . This means we round up, so that the “median” for is 1.
The mode is the most likely single point of the distribution. It’s easy to see that an arbitrarily small change to the distribution can have an arbitrarily large impact on the mode, making it much less informative than the mean or median. It’s defined as where is the pdf or pmf. There are distributions for which no closed formula for the mode is known.
One often sees the term , which is the RV obtained by shifting so that it now centers around and is hence called “centered”. The Covariance of two RVs and is defined as the expected product of two points from their centered distributions, i.e.,
The Covariance tells us whether two data sets are positively or negatively correlated. However, it does not tell us how much they’re correlated because the Covariance increases with the variance of either variable. (For example, , but the thing we mean by ‘correlation’ should stay the same if we rescale one of the RVs.) We can obtain the correlation by normalizing the Covariance, i.e.,
Thus, the Covariance is like the inner product on a vector space, and the Correlation like . The correlation ranges between and .
The skewness measures the asymmetry of a distribution. Symmetrical distributions such as or the uniform distribution have zero skewness.
As I understand it, the entropy measures the expected amount of information obtained by sampling the distribution once. The entropy is zero if and only if the RV is a constant. For a Bernoulli variable, the entropy is 1 and also maximal for .
In the discrete case with pmf , we have
Where the base of the logarithm determines whether we measure the information in bits or something else. As of now, I don’t fully understand how/why this definition measures information. Non-discrete numbers of bits are weird.
The -th moment of a RV is defined as
The -th central moment is defined as
The -th standardized moment is defined as
The expectation is the first moment, the variance is the second central moment, and the skewness is the third standardized moment. I still don’t quite know how to think about moments.
2.4. Independence of RVs
2.5. Multivariate Distributions
2.5.1. Marginal Distribution
2.6. Conditional Distributions
2.7. Transformations of RVs
2.8. Conditional Expectation and Variance
Given RVs and , the function defined by is a RV since
Thus, it makes sense to compute or . Most people don’t insist on writing it this way, so one finds just instead (yuck!). We have the formula
which feels like an extension of the law of total probability. Furthermore, we can define by . In this case, we have the formula
I’m assuming this would make sense if one were to spend enough time thinking about it.
The probability that any RV ends up being at least times its mean can at most be . (If it were more likely than that, this alone would imply that the mean is greater than itself). In symbols, . The other inequalities are much more complicated.
As mentioned before, this says that for convex—also, the opposite is true for concave. The only functions both convex and concave are affine linear functions, so the linearity of the expectation, , can be considered a special case of Jensen’s inequality.
It turns out my vector space analogy is more than an analogy. The Cauchy-Schwartz inequality says that, if is an inner product on a vector space, then
Take RVs as vectors, define (easy to verify that this is an inner product), and we obtain
which is identical to if and are centered. The book presents this inequality with (which is strictly larger than because is convex) on the left side. I don’t know how one shows that this stronger inequality also holds.
3. Convergence of Random Variables
The limiting behavior of sequences of RVs turns out to be extremely important in statistics because it tells us what we can know in the limit of infinite data. We have three definitions of decreasing strength (meaning that each implies the next) for convergence. All three translate what it means for a sequence of RVs to converge to a statement about what it means for numbers to converge.
iff . (Convergence in quadratic mean.)
iff . (Convergence in probability.)
iff at all points at which is continuous (where is the cdf of ). (Convergence in distribution.)
Convergence in distribution doesn’t imply convergence in probability: here is an example where it matters that a RV isn’t determined by its distribution. For symmetrical, “converges” in distribution to , but it sure doesn’t converge in probability.
Convergence in probability doesn’t imply convergence in quadratic mean: you can construct a RV with increasingly slim chances for increasingly large profits. Let be the RV for the game “flip coins, win dollars if all come up heads” and we get a counter-example (where each has expected value 1, but converges in probability to the RV that is constant 0).
The two important theorems are called the Law of Large Numbers and The Central Limit theorem. Let be a sequence of i.i.d. RVs. Both theorems are making a statement about the “sample mean”, of the distribution.
3.2.1. The Weak Law of Large Numbers
This theorem says that , where is the RV that’s always , and (since the are identically distributed, they all have the same mean). I.e., if one averages across more and more samples, the variance goes to zero.
3.2.2. The Central Limit theorem
I used to be confused about this theorem, thinking that it said the following:
This equation (which is not the central limit theorem) appears to say that converges in distribution toward a normal distribution that has increasingly small variance. However, the normal distribution with increasingly small variance just converges toward the constant distribution—also, we already know from the Law of Large Numbers that converges in probability, and hence also in distribution, toward .
… in fact, the law as stated is not just wrong but ill-defined because is not a constant distribution, so “converges” toward a moving target.
The proper Central Limit theorem is as follows: if , then
That is, the distribution of does converge toward a constant, but if we scale it up by (and hence scale up its variance by ), then it converges toward a normal distribution (with non-moving variance).
In practice, the useful part of the Central Limit Theorem is that we can approximate by even if is a constant. (Otherwise, it’s unclear why we should care about , and we already know the limiting behavior of .) This technically doesn’t follow from the theorem as-is since it makes no statements about how fast converges, but there is also a quantitative version. And for a constant , the formulations and and are indeed equivalent.
While converge in distribution doesn’t imply convergence in probability in general, it does do so if the limiting RV is a constant. Thus, the Weak Law of Large Numbers would immediately follow from the Central Limit theorem, except that the Weak Law of Large Numbers also applies in cases where the have infinite variance. An example of a RV with finite mean but infinite variance is the following game: “flip a coin until it gets tails; receive dollars where is the number of heads”.
A consequence of the ‘practical’ version of the central limit theorem is that the binomial distribution can be approximated by a normal distribution. Let . Then with . Right now, just sums up these variables without dividing by , but it’s trivial verify that . Thus, since (here I’ve used that and for Bernoulli Variables), we have .
Here’s an example. Note that, if (it’s common practice to write everything in terms of a standard normal RV and call that Z), then . Thus, if we want to know the probability that at most 4900 out of 10000 coins come up heads, we can compute
which gives according to WolframAlpha.
3.3 Convergence under Transformation
In which we study what can be said about , where is a function.