This post is something like an abridged summary. It’s definitely not a self-contained introduction to the field; I’ve included all important subjects as sections mostly to remind myself, but I skip over many of them and make no effort to introduce every concept I write about. The primary purpose is learning by writing; beyond that, it may be useful for people already familiar with statistics to refresh their knowledge.

I’m also including more of my own takes than I’ve done in my previous posts. For example, if I sound dismissive of Frequentism, this is not the book’s fault.

My verdict on the book’s quality that it’s one of the weakest ones Miri’s guide (but the competition is tough). This is mostly based on my recurring feeling that ‘this could have been explained better’. Nonetheless, I’m not familiar with a better book on statistics.

The book is structured in three parts: (1) Probability, (2) Statistical Inference, and (3) Statistical Models and Methods. Since this post was getting too long, I’ve decided to structure my summary in the same way: this post covers probability, the other post covers Statistical Inference, and there may be a third post in the future covering Statistical Models and Methods.

1. Probability spaces and distributions

The general notion of a probability space includes σ-algebras, but the book gets by without them. Thus, a probability space is either discrete with a probability mass function (pmf) or continuous with a probability density function (pdf). I denote both by writing f. In both cases, f determines a distribution P that assigns probability mass to subsets of the sample space Ω.

1.1. Events

1.2. Independence

1.3. Finite Sample Spaces

1.4. Conditional Probability

Given two events A,B⊂Ω, the formula everyone learns in high school is

P(A|B)=P(A∩B)P(B)

In university courses, it may also be mentioned that P(⋅|B) is a probability space for any nonempty set B⊂Ω.

I prefer to take that as the starting point, i.e., “evaluate event A in the probability distribution over B” rather than “plug A into the formula above”. This seems to be what people do when they’re not trying to do math. If I hear that a die came up either 2 or 3 or 5 and consider the probability that it’s at least a 5, I don’t think “let’s take the intersection of {5,6} and {2,3,5}; let’s compute P({5}); let’s compute P({2,3,5}; let’s divide 1/61/2 to get 13”. Instead, I think something like, “it can’t be 6, there are three remaining (equally probable) possibilities, so it’s 13”. Because of this, I like the notation PB(A) rather than P(A|B).

A random variable (RV) is a way to measure properties of events that we care about. If Ω={1,2,3,4,5,6}5 denotes all results of throwing five dice, a random variable may count the sum of results, the maximum result, the minimum result, or anything else we are interested in.

The traditional formalism defines a RV as a function X:Ω→R.

In advanced probability theory, the sample spaces soon slide into the background. To see why, consider a random variable X that measures how often patients get surgery. In this case, the probability space Ω of X is the set of all possible data sets that we could have obtained by collecting data. Trying to define and deal with this space is impractical, so instead, one generally talks in terms of RVs and their distributions.

To define this formally, let X:Ω→R with probability measure P on Ω. Then, we get a new probability space Ω′:=X(Ω)=R with probability measure P′ defined by P′(A):=P(Ω−1(A)) for any A⊆R. This works regardless of whether P is given by a pmf or pdf. In fact, by the time one gets to continuous spaces, the sample space has already disappeared.

There are two related points I’ve been thinking about while working through the book. The first is a formal detail that most people wouldn’t care about (perhaps because I am an extreme type 1 mathematician). Given the formal definition of a RV as a function X:Ω→R, the RV doesn’t ‘know’ its distribution. It also doesn’t know enough to determine the value of statistical functionals like E(X). Now consider the notation X∼Bernoulli(p). This looks like a statement about X, but it doesn’t determine what function X is—in fact, the only constraint it puts on the sample space Ω is that |Ω|>1. In this formalism, X∼Bernoulli(p) is really a statement about the distribution of X.

To resolve this, I’ve transitioned to thinking of any random variable X as a pair (P,X′), where X′:Ω→R is the usual function and P the entire (discrete or continuous or general) probability space. Everyone already talks about RVs as if they knew their distributions, so this formalism simply adopts that convention. Now, X∼Bernoulli(p) really is a statement about X. The world is well again.

The second point is that the distribution does not fully characterize a RV, and this begins to matter as soon as random variables are combined. To demonstrate this, suppose we have two RVs X,Y. Let fX by the pdf of X and fY be the pdf of Y. What is the pdf of X+Y (the RV defined by ω↦X(ω)+Y(ω))? This question is impossible to answer without knowing how X and Y map elements from their probability space onto R: For example, suppose X,Y∼N(0,1):

if Y=X, then X+Y is continuous with pdf fX+Y(x)=1√8πe−18x2

if Y=−X (this does not violate the assumption that X and Y are equally distributed), then X+Y is discrete with pmf fX+Y(0)=1 and fX+Y(x)=0 for x≠0

if X and Y are independent, then X+Y is continuous with pdf fX+Y(x)=1√4πe−14x2

In practice, one either has the assumption that X and Y are independent (which is defined by the condition that, for any sets A,B⊂R, the events X−1(A) and Y−1(B) are independent), or one constructs Y as a function of X as above. However, those are only a small subset of all possible relationships that two RVs can have. In general, the distribution of X+Y can take arbitrary forms. Thus, one needs to know both the distribution and the mapping to characterize a RV.

2.2. Important Distributions

I think it’s an underappreciated aspect of probability theory how much it builds on top of itself. So many of the important distributions are derived out of prior distributions.

2.2.1. Bernoulli

The simplest possible distribution is given by Pr(X=1)=p and Pr(X=0)=1−p. We write X∼Bernoulli(p). As with all of these, we don’t specify the sample space Ω.

2.2.2. Binomial

A Binomial distribution is a Bernoulli distribution sampled n times where X counts the number of samples that came up 1. I call them “hits” and the others “misses”. We write X∼Binomial(n,p) and have the formula

P(X=k)=(nk)pk(1−p)n−k

It’s not too difficult to reason why this formula has this exact form.

2.2.3. Geometric

Instead of repeating Bernoulli experiments a fixed number of times, we now repeat them until we miss once. We write X∼Geom(p) and have the formula

P(X=k)=pk(1−p)

Unfortunately, people couldn’t agree on whether we should count the number of hits or the number of samples (which is always one more), so there are two forms with slightly different formulas. In the above, I count the number of hits.

2.2.4. Poisson

Instead of repeating Bernoulli experiments a fixed number of times, or until we get one miss, we now repeat them infinitely often but decrease the probability of hitting so that the expected number of hits remains constant. Our hit probability p becomes λn where λ is the expected number of hits. We write X∼Poisson(λ) and have the formula

P(X=k)=limn→∞(nk)(λn)k(1−λn)n−k=e−λλkk!

Computing this limit is not trivial but doable.

These are all discrete distributions. The first two are finite, and the latter two countably infinite. The step to continuous distributions is a bigger change.

2.2.5. Normal

Instead of repeating Bernoulli experiments a fixed number of times, or until we get a miss, or infinitely often while holding the expected number of hits constant, we now repeat it infinitely often without holding the expected number of hits constant. If p remains the parameter of our Bernoulli distribution, we get

The mean and variance of a binomial distribution are np and np(1−p), respectively. That means one parameter is enough to control them both, and we can see both of those terms in the formula above. However, it turns out you can also tweak this distribution so that they come apart.

For a proper normal distribution, we write X∼N(μ,σ2) and have the equation

f(x)=1√2πσe−12(x−μσ)2

and this is, by many measures, the most important distribution. Unfortunately, it’s also one of the hardest to compute probabilities with analytically.

2.2.6. Uniform

With constant pdf f(x)=1b−a in [a,b], and f(x)≡0 for x∉[a,b]. We write X∼Uniform(a,b)

2.2.7. Beta

With pdf f(x)=Γ(α+β)Γ(α)Γ(β)xα−1(1−x)β−1 where Γ is the Gamma function. I’m yet to need the explicit formula for Γ.

Among other things, the beta distribution can describe

the distribution of the k-th highest number across n RVs sampled with the uniform distribution

if X∼Bernoulli(p) with p unknown, the pdf of p after doing a Bayesian update on samples of X, provided that the prior of p was either uniform or itself a Beta distribution.

2.2.8. Exponential

2.2.9. Gamma

2.2.10. Cauchy

2.3. Statistical Functionals

A statistical functional is an operator from the distribution of one or more random variable(s) to R. In this case, we consider only statistical functionals that return a single real number that provides some information about the distribution. Statistical functionals aren’t functions because you can’t define their domain, but I believe they are what’s called class functions in set theory.

You might think that only the expected value and variance are important, but in fact, there are a surprising number of relevant statistical functionals. For most of them, closed formulas expressing their results in terms of the distribution’s parameters are known (and in many cases, deriving them isn’t too difficult).

2.3.1. Expectation

A possible definition for the expected value is

E(X):=∑ω∈ΩP(ω)⋅X(ω)

(This also has a continuous version). Of course, as I’ve gone over before, no one thinks in terms of ‘events from the sample space and their probability’, but only ‘possible outcomes of X and their probability’. Thus, we instead have the formula

E(X)={∫X(Ω)xf(x)dx;X is continuous∑x∈X(Ω)xf(x);X is discrete

One can write R instead of X(Ω), but I like this version as a reminder that we’re looking at image points.

An advantage of this formula is the following result:

E(g(X))={∫X(Ω)g(x)f(x)dx;X is continuous∑x∈X(Ω)g(x)f(x);X is discrete

Again, this works because we already think in terms of the function that gives the probability for results. For the original expectation, we ask “for each outcome x, multiply x with the probability of x”; now, we ask “for each outcome x, multiply the transformed outcome g(x) with the probability of x”.

One often writes μX or just μ for E(X)

The equation E(XY)=E(X)E(Y) holds whenever X and Y are independent. On the other hand, the equation E(X+Y)=E(X)+E(Y) always holds, and I find that people (in everyday life) tend to make arguments that directly contradict this rule. One example is the argument that we should be indifferent between two actions because most of the relevant variables are unknown. (The known ones already increase the expected value of one action over the other, unless you specifically suspect the unknown ones to skew into the other direction.) A related argument is that high error bars/a possibility of failure implies that taking an action isn’t worth it.

2.3.2. Variance

One thing I would do if I were teaching a class on probability is to ask students how the ‘variance’ of a RV should be defined before revealing the textbook definition. One might reasonably guess E(|X−E(X)|)=E(|X−μ|), but mathematicians have an immortal preference of squaring over taking absolute values, so it’s E((X−μ)2) instead.^{[2]} One can prove that V(X)=E(X2)−E(X)2.

2.3.3. Standard Deviation

This is just the square of the variance. One often writes σX or just σ for √V(X).

2.3.4 Median

The median is the number at the center of the distribution, i.e., the x such that F(x)=12. Since such an x may not exist if F is discrete p, we define it as infx∈R{F(x)|F(x)>12}. This means we round up, so that the “median” for X∼Binom(p) is 1.

2.3.5 Mode

The mode is the most likely single point of the distribution. It’s easy to see that an arbitrarily small change to the distribution can have an arbitrarily large impact on the mode, making it much less informative than the mean or median. It’s defined as argmaxx∈Rf(x) where x is the pdf or pmf. There are distributions for which no closed formula for the mode is known.

2.3.6. Covariance

One often sees the term X−μ, which is the RV obtained by shifting X so that it now centers around 0 and is hence called “centered”. The Covariance of two RVs X and Y is defined as the expected product of two points from their centered distributions, i.e.,

Cov(X,Y)=E[(X−μX)(Y−μY)]

2.3.7. Correlation

The Covariance tells us whether two data sets are positively or negatively correlated. However, it does not tell us how much they’re correlated because the Covariance increases with the variance of either variable. (For example, Cov(X,2Y)=2Cov(X,Y), but the thing we mean by ‘correlation’ should stay the same if we rescale one of the RVs.) We can obtain the correlation by normalizing the Covariance, i.e.,

Corr(X,Y)=Cov(X,Y)σXσY

Thus, the Covariance is like the inner product ⟨x,y⟩ on a vector space, and the Correlation like ⟨x,y⟩||x||||y||. The correlation ranges between −1 and 1.

2.2.8. Skewness

The skewness measures the asymmetry of a distribution. Symmetrical distributions such as N(0,σ2) or the uniform distribution have zero skewness.

κ(X)=∫(x−μσ)3f(x)dx

2.3.9. Entropy

As I understand it, the entropy measures the expected amount of information obtained by sampling the distribution once. The entropy is zero if and only if the RV is a constant. For a Bernoulli variable, the entropy is 1 and also maximal for p=0.5.

In the discrete case with pmf f, we have

H(X)=−∑y∈X(Ω)f(y)logf(y)

Where the base of the logarithm determines whether we measure the information in bits or something else. As of now, I don’t fully understand how/why this definition measures information. Non-discrete numbers of bits are weird.

2.3.10. Moments

The k-th moment of a RV X is defined as

∫∞−∞xkf(x)dx

The k-th central moment is defined as

∫∞−∞[x−μ]kf(x)dx

The k-th standardized moment is defined as

∫∞−∞(x−μσ)kf(x)dx

The expectation is the first moment, the variance is the second central moment, and the skewness is the third standardized moment. I still don’t quite know how to think about moments.

2.4. Independence of RVs

2.5. Multivariate Distributions

2.5.1. Marginal Distribution

2.6. Conditional Distributions

2.7. Transformations of RVs

2.8. Conditional Expectation and Variance

Given RVs X and Y, the function g defined by g(y):=E(X[Y=y]) is a RV since

g:Ω→Rg:ω↦E(X[Y=ω])

Thus, it makes sense to compute E(g) or V(g). Most people don’t insist on writing it this way, so one finds just E(X|Y) instead (yuck!). We have the formula

E(g)"="E(E(X|Y))=E(X)

which feels like an extension of the law of total probability. Furthermore, we can define h by h(y):=V(X[Y=y]). In this case, we have the formula

V(X)=V(g)+E(h)"="V(E(X|Y))+E(V(X|Y))

I’m assuming this would make sense if one were to spend enough time thinking about it.

2.9. Inequalities

2.9.1. Markov

The probability that any RV ends up being at least k times its mean can at most be 1k. (If it were more likely than that, this alone would imply that the mean is greater than itself). In symbols, P(X≥kμ)≤1k. The other inequalities are much more complicated.

2.9.2. Tshebycheff

2.9.3. Hoeffding

2.9.4. Jensen

As mentioned before, this says that E(g(X))≤g(E(X)) for g convex—also, the opposite is true for g concave. The only functions both convex and concave are affine linear functions, so the linearity of the expectation, E(aX+c)=aE(X)+c, can be considered a special case of Jensen’s inequality.

2.9.5. Cauchy-Schwartz

It turns out my vector space analogy is more than an analogy. The Cauchy-Schwartz inequality says that, if ⟨⟩ is an inner product on a vector space, then

|⟨x,y⟩|≤||x||||y||

Take RVs as vectors, define ⟨X,Y⟩:=E(XY) (easy to verify that this is an inner product), and we obtain

|E(XY)|≤√E(X2)E(Y2)

which is identical to |Cov(X,Y)|≤σXσY if X and Y are centered. The book presents this inequality with E(|XY|) (which is strictly larger than |E(XY)| because |x| is convex) on the left side. I don’t know how one shows that this stronger inequality also holds.

3. Convergence of Random Variables

3.1. Fundamentals

The limiting behavior of sequences of RVs turns out to be extremely important in statistics because it tells us what we can know in the limit of infinite data. We have three definitions of decreasing strength (meaning that each implies the next) for convergence. All three translate what it means for a sequence of RVs to converge to a statement about what it means for numbers to converge.

(X1,...,Xn)qm→X iff E((X−Xn)2)n→∞⟶0. (Convergence in quadratic mean.)

(X1,...,Xn)p→X iff ∀ϵ>0:Pr(|X−Xn|>ϵ)n→∞⟶0. (Convergence in probability.)

(X1,...,Xn)⇝X iff Fn(x)n→∞⟶F(x) at all points x at which F is continuous (where F is the cdf of X). (Convergence in distribution.)

Convergence in distribution doesn’t imply convergence in probability: here is an example where it matters that a RV isn’t determined by its distribution. For X symmetrical, (−X,...,−X) “converges” in distribution to X, but it sure doesn’t converge in probability.

Convergence in probability doesn’t imply convergence in quadratic mean: you can construct a RV with increasingly slim chances for increasingly large profits. Let Xn be the RV for the game “flip n coins, win 2n dollars if all come up heads” and we get a counter-example (where each Xi has expected value 1, but (X1,...,Xn) converges in probability to the RV that is constant 0).

3.2. Theorems

The two important theorems are called the Law of Large Numbers and The Central Limit theorem. Let (X1,...,Xn) be a sequence of i.i.d. RVs. Both theorems are making a statement about the “sample mean”, ¯¯¯¯¯Xn:=1n∑nk=1Xi of the distribution.

3.2.1. The Weak Law of Large Numbers

This theorem says that (X1,...,Xn)p→Cμ, where Cμ≡μ is the RV that’s always μ, and μ=E(Xi) (since the Xi are identically distributed, they all have the same mean). I.e., if one averages across more and more samples, the variance goes to zero.

3.2.2. The Central Limit theorem

I used to be confused about this theorem, thinking that it said the following:

¯¯¯¯¯Xn⇝Z where Z∼N(μ,σ2n)

This equation (which is not the central limit theorem) appears to say that ¯¯¯¯¯Xn converges in distribution toward a normal distribution that has increasingly small variance. However, the normal distribution with increasingly small variance just converges toward the constant distribution—also, we already know from the Law of Large Numbers that ¯¯¯¯¯Xn converges in probability, and hence also in distribution, toward Cμ.

… in fact, the law as stated is not just wrong but ill-defined because Z is not a constant distribution, so ¯¯¯¯¯Xn “converges” toward a moving target.

The proper Central Limit theorem is as follows: if V(Xi)=σ2∈R, then

√n(¯¯¯¯¯Xn−μ)⇝Z where Z∼N(0,σ2)

That is, the distribution of ¯¯¯¯¯Xn does converge toward a constant, but if we scale it up by √n (and hence scale up its variance by n), then it converges toward a normal distribution (with non-moving variance).

In practice, the useful part of the Central Limit Theorem is that we can approximate √n¯¯¯¯¯Xn by Z even if n is a constant. (Otherwise, it’s unclear why we should care about √n¯¯¯¯¯Xn, and we already know the limiting behavior of ¯¯¯¯¯Xn.) This technically doesn’t follow from the theorem as-is since it makes no statements about how fast √n¯¯¯¯¯Xn converges, but there is also a quantitative version. And for a constant n, the formulations √n(¯¯¯¯¯Xn−μ)≈N(0,σ2) and ¯¯¯¯¯Xn≈N(μ,σ2n) and √n¯¯¯¯¯Xn≈N(√nμ,σ2) are indeed equivalent.

While converge in distribution doesn’t imply convergence in probability in general, it does do so if the limiting RV is a constant. Thus, the Weak Law of Large Numbers would immediately follow from the Central Limit theorem, except that the Weak Law of Large Numbers also applies in cases where the Xi have infinite variance. An example of a RV with finite mean but infinite variance is the following game: “flip a coin until it gets tails; receive √2n dollars where n is the number of heads”.

A consequence of the ‘practical’ version of the central limit theorem is that the binomial distribution can be approximated by a normal distribution. Let X∼Binom(n,p). Then X=∑nk=1Xi with Xi∼Bernoulli(p). Right now, X just sums up these variables without dividing by n, but it’s trivial verify that X=n⋅¯¯¯¯¯Xn=√n(√n¯¯¯¯¯Xn). Thus, since √n¯¯¯¯¯Xn≈N(√np,p(1−p)) (here I’ve used that μ=p and σ2=p(1−p) for Bernoulli Variables), we have X≈N(np,np(1−p)).

Here’s an example. Note that, if Z∼N(0,1) (it’s common practice to write everything in terms of a standard normal RV and call that Z), then X−np√np(1−p)≈Z. Thus, if we want to know the probability that at most 4900 out of 10000 coins come up heads, we can compute

## Insights from “All of Statistics”: Probability

All of Statistics is yet another book on Miri’s research guide. This book has also been reviewed by Alex Turner.

This post is something like an abridged summary. It’s definitely not a self-contained introduction to the field; I’ve included all important subjects as sections mostly to remind myself, but I skip over many of them and make no effort to introduce every concept I write about. The primary purpose is learning by writing; beyond that, it may be useful for people already familiar with statistics to refresh their knowledge.

I’m also including more of my own takes than I’ve done in my previous posts. For example, if I sound dismissive of Frequentism, this is not the book’s fault.

My verdict on the book’s quality that it’s one of the weakest ones Miri’s guide (but the competition is tough). This is mostly based on my recurring feeling that ‘this could have been explained better’. Nonetheless, I’m not familiar with a better book on statistics.

The book is structured in three parts: (1) Probability, (2) Statistical Inference, and (3) Statistical Models and Methods. Since this post was getting too long, I’ve decided to structure my summary in the same way: this post covers probability, the other post covers Statistical Inference, and there may be a third post in the future covering Statistical Models and Methods.

## 1. Probability spaces and distributions

The general notion of a probability space includes σ-algebras, but the book gets by without them. Thus, a probability space is either discrete with a

or continuous with aprobability mass function (pmf). I denote both by writing f. In both cases, f determines a distribution P that assigns probability mass to subsets of the sample space Ω.probability density function (pdf)## 1.1. Events

## 1.2. Independence

## 1.3. Finite Sample Spaces

## 1.4. Conditional Probability

Given two events A,B⊂Ω, the formula everyone learns in high school is

P(A|B)=P(A∩B)P(B)

In university courses, it may also be mentioned that P(⋅|B) is a probability space for any nonempty set B⊂Ω.

I prefer to take that as the starting point, i.e., “evaluate event A in the probability distribution over B” rather than “plug A into the formula above”. This seems to be what people do when they’re not trying to do math. If I hear that a die came up either 2 or 3 or 5 and consider the probability that it’s at least a 5, I don’t think “let’s take the intersection of {5,6} and {2,3,5}; let’s compute P({5}); let’s compute P({2,3,5}; let’s divide 1/61/2 to get 13”. Instead, I think something like, “it can’t be 6, there are three remaining (equally probable) possibilities, so it’s 13”. Because of this, I like the notation PB(A) rather than P(A|B).

## 1.5. Joint Distributions

## 1.6. Marginal Distributions

## 1.7. Bayes’ Theorem

See this amazing guide.

## 2. Random Variables

## 2.1. Fundamentals

A random variable (RV) is a way to measure properties of events that we care about. If Ω={1,2,3,4,5,6}5 denotes all results of throwing five dice, a random variable may count the sum of results, the maximum result, the minimum result, or anything else we are interested in.

The traditional formalism defines a RV as a function X:Ω→R.

In advanced probability theory, the sample spaces soon slide into the background. To see why, consider a random variable X that measures how often patients get surgery. In this case, the probability space Ω of X is

the set of all possible data setsthat wecould have obtainedby collecting data. Trying to define and deal with this space is impractical, so instead, one generally talks in terms of RVs and their distributions.To define this formally, let X:Ω→R with probability measure P on Ω. Then, we get a new probability space Ω′:=X(Ω)=R with probability measure P′ defined by P′(A):=P(Ω−1(A)) for any A⊆R. This works regardless of whether P is given by a pmf or pdf. In fact, by the time one gets to continuous spaces, the sample space has already disappeared.

There are two related points I’ve been thinking about while working through the book. The first is a formal detail that most people wouldn’t care about (perhaps because I am an extreme type 1 mathematician). Given the formal definition of a RV as a function X:Ω→R, the RV doesn’t ‘know’ its distribution. It also doesn’t know enough to determine the value of statistical functionals like E(X). Now consider the notation X∼Bernoulli(p). This looks like a statement about X, but it doesn’t determine what function X is—in fact, the only constraint it puts on the sample space Ω is that |Ω|>1. In this formalism, X∼Bernoulli(p) is really a statement about the distribution of X.

To resolve this, I’ve transitioned to thinking of any random variable X as a pair (P,X′), where X′:Ω→R is the usual function and P the entire (discrete or continuous or general) probability space. Everyone already talks about RVs as if they knew their distributions, so this formalism simply adopts that convention. Now, X∼Bernoulli(p) really is a statement about X. The world is well again.

The second point is that the distribution does not fully characterize a RV, and this begins to matter as soon as random variables are combined. To demonstrate this, suppose we have two RVs X,Y. Let fX by the pdf of X and fY be the pdf of Y. What is the pdf of X+Y (the RV defined by ω↦X(ω)+Y(ω))? This question is impossible to answer without knowing how X and Y map elements from their probability space onto R: For example, suppose X,Y∼N(0,1):

if Y=X, then X+Y is continuous with pdf fX+Y(x)=1√8πe−18x2

if Y=−X (this does not violate the assumption that X and Y are equally distributed), then X+Y is discrete with pmf fX+Y(0)=1 and fX+Y(x)=0 for x≠0

if X and Y are independent, then X+Y is continuous with pdf fX+Y(x)=1√4πe−14x2

In practice, one either has the assumption that X and Y are independent (which is defined by the condition that, for any sets A,B⊂R, the events X−1(A) and Y−1(B) are independent), or one constructs Y as a function of X as above. However, those are only a small subset of all possible relationships that two RVs can have. In general, the distribution of X+Y can take arbitrary forms. Thus, one needs to know both the distribution and the mapping to characterize a RV.

## 2.2. Important Distributions

I think it’s an underappreciated aspect of probability theory how much it builds on top of itself. So many of the important distributions are derived out of prior distributions.

## 2.2.1. Bernoulli

The simplest possible distribution is given by Pr(X=1)=p and Pr(X=0)=1−p. We write X∼Bernoulli(p). As with all of these, we don’t specify the sample space Ω.

## 2.2.2. Binomial

A Binomial distribution is a Bernoulli distribution sampled n times where X counts the number of samples that came up 1. I call them “hits” and the others “misses”. We write X∼Binomial(n,p) and have the formula

P(X=k)=(nk)pk(1−p)n−k

It’s not too difficult to reason why this formula has this exact form.

## 2.2.3. Geometric

Instead of repeating Bernoulli experiments a fixed number of times, we now repeat them until we miss once. We write X∼Geom(p) and have the formula

P(X=k)=pk(1−p)

Unfortunately, people couldn’t agree on whether we should count the number of hits or the number of samples (which is always one more), so there are two forms with slightly different formulas. In the above, I count the number of hits.

## 2.2.4. Poisson

Instead of repeating Bernoulli experiments a fixed number of times, or until we get one miss, we now repeat them infinitely often but decrease the probability of hitting so that the expected number of hits remains constant. Our hit probability p becomes λn where λ is the expected number of hits. We write X∼Poisson(λ) and have the formula

P(X=k)=limn→∞(nk)(λn)k(1−λn)n−k=e−λλkk!

Computing this limit is not trivial but doable.

These are all discrete distributions. The first two are finite, and the latter two countably infinite. The step to continuous distributions is a bigger change.

## 2.2.5. Normal

Instead of repeating Bernoulli experiments a fixed number of times, or until we get a miss, or infinitely often while holding the expected number of hits constant, we now repeat it infinitely often without holding the expected number of hits constant. If p remains the parameter of our Bernoulli distribution, we get

P(X=x)=limn→∞(nk)pk(1−p)n−k=1√πnp(1−p)e−12(k−np)22np(1−p)

for k close to np.

^{[1]}The mean and variance of a binomial distribution are np and np(1−p), respectively. That means one parameter is enough to control them both, and we can see both of those terms in the formula above. However, it turns out you can also tweak this distribution so that they come apart.

For a proper normal distribution, we write X∼N(μ,σ2) and have the equation

f(x)=1√2πσe−12(x−μσ)2

and this is, by many measures, the most important distribution. Unfortunately, it’s also one of the hardest to compute probabilities with analytically.

## 2.2.6. Uniform

With constant pdf f(x)=1b−a in [a,b], and f(x)≡0 for x∉[a,b]. We write X∼Uniform(a,b)

## 2.2.7. Beta

With pdf f(x)=Γ(α+β)Γ(α)Γ(β)xα−1(1−x)β−1 where Γ is the Gamma function. I’m yet to need the explicit formula for Γ.

Among other things, the beta distribution can describe

the distribution of the k-th highest number across n RVs sampled with the uniform distribution

if X∼Bernoulli(p) with p unknown, the pdf of p after doing a Bayesian update on samples of X, provided that the prior of p was either uniform or itself a Beta distribution.

## 2.2.8. Exponential

## 2.2.9. Gamma

## 2.2.10. Cauchy

## 2.3. Statistical Functionals

A statistical functional is an operator from the distribution of one or more random variable(s) to R. In this case, we consider only statistical functionals that return a single real number that provides some information about the distribution. Statistical functionals aren’t functions because you can’t define their domain, but I believe they are what’s called class functions in set theory.

You might think that only the expected value and variance are important, but in fact, there are a surprising number of relevant statistical functionals. For most of them, closed formulas expressing their results in terms of the distribution’s parameters are known (and in many cases, deriving them isn’t too difficult).

## 2.3.1. Expectation

A possible definition for the expected value is

E(X):=∑ω∈ΩP(ω)⋅X(ω)

(This also has a continuous version). Of course, as I’ve gone over before, no one thinks in terms of ‘events from the sample space and their probability’, but only ‘possible outcomes of X and their probability’. Thus, we instead have the formula

E(X)={∫X(Ω)xf(x)dx;X is continuous∑x∈X(Ω)xf(x);X is discrete

One can write R instead of X(Ω), but I like this version as a reminder that we’re looking at image points.

An advantage of this formula is the following result:

E(g(X))={∫X(Ω)g(x)f(x)dx;X is continuous∑x∈X(Ω)g(x)f(x);X is discrete

Again, this works because we already think in terms of the function that gives the probability for results. For the original expectation, we ask “for each outcome x, multiply x with the probability of x”; now, we ask “for each outcome x, multiply the transformed outcome g(x) with the probability of x”.

One often writes μX or just μ for E(X)

The equation E(XY)=E(X)E(Y) holds whenever X and Y are independent. On the other hand, the equation E(X+Y)=E(X)+E(Y) always holds, and I find that people (in everyday life) tend to make arguments that directly contradict this rule. One example is the argument that we should be indifferent between two actions because most of the relevant variables are unknown. (The known ones already increase the expected value of one action over the other, unless you specifically suspect the unknown ones to skew into the other direction.) A related argument is that high error bars/a possibility of failure implies that taking an action isn’t worth it.

## 2.3.2. Variance

One thing I would do if I were teaching a class on probability is to ask students how the ‘variance’ of a RV should be defined before revealing the textbook definition. One might reasonably guess E(|X−E(X)|)=E(|X−μ|), but mathematicians have an immortal preference of squaring over taking absolute values, so it’s E((X−μ)2) instead.

^{[2]}One can prove that V(X)=E(X2)−E(X)2.## 2.3.3. Standard Deviation

This is just the square of the variance. One often writes σX or just σ for √V(X).

## 2.3.4 Median

The median is the number at the center of the distribution, i.e., the x such that F(x)=12. Since such an x may not exist if F is discrete p, we define it as infx∈R{F(x)|F(x)>12}. This means we round up, so that the “median” for X∼Binom(p) is 1.

## 2.3.5 Mode

The mode is the most likely single point of the distribution. It’s easy to see that an arbitrarily small change to the distribution can have an arbitrarily large impact on the mode, making it much less informative than the mean or median. It’s defined as argmaxx∈Rf(x) where x is the pdf or pmf. There are distributions for which no closed formula for the mode is known.

## 2.3.6. Covariance

One often sees the term X−μ, which is the RV obtained by shifting X so that it now centers around 0 and is hence called “centered”. The Covariance of two RVs X and Y is defined as the expected product of two points from their centered distributions, i.e.,

Cov(X,Y)=E[(X−μX)(Y−μY)]

## 2.3.7. Correlation

The Covariance tells us whether two data sets are positively or negatively correlated. However, it does not tell us

how muchthey’re correlated because the Covariance increases with the variance of either variable. (For example, Cov(X,2Y)=2Cov(X,Y), but the thing we mean by ‘correlation’ should stay the same if we rescale one of the RVs.) We can obtain the correlation by normalizing the Covariance, i.e.,Corr(X,Y)=Cov(X,Y)σXσY

Thus, the Covariance is like the inner product ⟨x,y⟩ on a vector space, and the Correlation like ⟨x,y⟩||x||||y||. The correlation ranges between −1 and 1.

## 2.2.8. Skewness

The skewness measures the asymmetry of a distribution. Symmetrical distributions such as N(0,σ2) or the uniform distribution have zero skewness.

κ(X)=∫(x−μσ)3f(x)dx

## 2.3.9. Entropy

As I understand it, the entropy measures the expected amount of information obtained by sampling the distribution once. The entropy is zero if and only if the RV is a constant. For a Bernoulli variable, the entropy is 1 and also maximal for p=0.5.

In the discrete case with pmf f, we have

H(X)=−∑y∈X(Ω)f(y)logf(y)

Where the base of the logarithm determines whether we measure the information in bits or something else. As of now, I don’t fully understand how/why this definition measures information. Non-discrete numbers of bits are weird.

## 2.3.10. Moments

The k-th moment of a RV X is defined as

∫∞−∞xkf(x)dx

The k-th central moment is defined as

∫∞−∞[x−μ]kf(x)dx

The k-th standardized moment is defined as

∫∞−∞(x−μσ)kf(x)dx

The expectation is the first moment, the variance is the second central moment, and the skewness is the third standardized moment. I still don’t quite know how to think about moments.

## 2.4. Independence of RVs

## 2.5. Multivariate Distributions

## 2.5.1. Marginal Distribution

## 2.6. Conditional Distributions

## 2.7. Transformations of RVs

## 2.8. Conditional Expectation and Variance

Given RVs X and Y, the function g defined by g(y):=E(X[Y=y]) is a RV since

g:Ω→Rg:ω↦E(X[Y=ω])

Thus, it makes sense to compute E(g) or V(g). Most people don’t insist on writing it this way, so one finds just E(X|Y) instead (yuck!). We have the formula

E(g)"="E(E(X|Y))=E(X)

which feels like an extension of the law of total probability. Furthermore, we can define h by h(y):=V(X[Y=y]). In this case, we have the formula

V(X)=V(g)+E(h)"="V(E(X|Y))+E(V(X|Y))

I’m assuming this would make sense if one were to spend enough time thinking about it.

## 2.9. Inequalities

## 2.9.1. Markov

The probability that any RV ends up being at least k times its mean can at most be 1k. (If it were more likely than that, this alone would imply that the mean is greater than itself). In symbols, P(X≥kμ)≤1k. The other inequalities are much more complicated.

## 2.9.2. Tshebycheff

## 2.9.3. Hoeffding

## 2.9.4. Jensen

As mentioned before, this says that E(g(X))≤g(E(X)) for g convex—also, the opposite is true for g concave. The only functions both convex and concave are affine linear functions, so the linearity of the expectation, E(aX+c)=aE(X)+c, can be considered a special case of Jensen’s inequality.

## 2.9.5. Cauchy-Schwartz

It turns out my vector space analogy is more than an analogy. The Cauchy-Schwartz inequality says that, if ⟨⟩ is an inner product on a vector space, then

|⟨x,y⟩|≤||x||||y||

Take RVs as vectors, define ⟨X,Y⟩:=E(XY) (easy to verify that this is an inner product), and we obtain

|E(XY)|≤√E(X2)E(Y2)

which is identical to |Cov(X,Y)|≤σXσY if X and Y are centered. The book presents this inequality with E(|XY|) (which is strictly larger than |E(XY)| because |x| is convex) on the left side. I don’t know how one shows that this stronger inequality also holds.

## 3. Convergence of Random Variables

## 3.1. Fundamentals

The limiting behavior of sequences of RVs turns out to be extremely important in statistics because it tells us what we can know in the limit of infinite data. We have three definitions of decreasing strength (meaning that each implies the next) for convergence. All three translate what it means for a sequence of RVs to converge to a statement about what it means for numbers to converge.

(X1,...,Xn)qm→X iff E((X−Xn)2)n→∞⟶0. (Convergence in quadratic mean.)

(X1,...,Xn)p→X iff ∀ϵ>0:Pr(|X−Xn|>ϵ)n→∞⟶0. (Convergence in probability.)

(X1,...,Xn)⇝X iff Fn(x)n→∞⟶F(x) at all points x at which F is continuous (where F is the cdf of X). (Convergence in distribution.)

Convergence in distribution doesn’t imply convergence in probability:here is an example where it matters that a RV isn’t determined by its distribution. For X symmetrical, (−X,...,−X) “converges” in distribution to X, but it sure doesn’t converge in probability.Convergence in probability doesn’t imply convergence in quadratic mean:you can construct a RV with increasingly slim chances for increasingly large profits. Let Xn be the RV for the game “flip n coins, win 2n dollars if all come up heads” and we get a counter-example (where each Xi has expected value 1, but (X1,...,Xn) converges in probability to the RV that is constant 0).## 3.2. Theorems

The two important theorems are called the Law of Large Numbers and The Central Limit theorem. Let (X1,...,Xn) be a sequence of i.i.d. RVs. Both theorems are making a statement about the “sample mean”, ¯¯¯¯¯Xn:=1n∑nk=1Xi of the distribution.

## 3.2.1. The Weak Law of Large Numbers

This theorem says that (X1,...,Xn)p→Cμ, where Cμ≡μ is the RV that’s always μ, and μ=E(Xi) (since the Xi are identically distributed, they all have the same mean). I.e., if one averages across more and more samples, the variance goes to zero.

## 3.2.2. The Central Limit theorem

I used to be confused about this theorem, thinking that it said the following:

¯¯¯¯¯Xn⇝Z where Z∼N(μ,σ2n)

This equation (which is not the central limit theorem) appears to say that ¯¯¯¯¯Xn converges in distribution toward a normal distribution that has increasingly small variance. However, the normal distribution with increasingly small variance just converges toward the constant distribution—also, we already know from the Law of Large Numbers that ¯¯¯¯¯Xn converges in probability, and hence also in distribution, toward Cμ.

… in fact, the law as stated is not just wrong but ill-defined because Z is not a constant distribution, so ¯¯¯¯¯Xn “converges” toward a moving target.

The proper Central Limit theorem is as follows: if V(Xi)=σ2∈R, then

√n(¯¯¯¯¯Xn−μ)⇝Z where Z∼N(0,σ2)

That is, the distribution of ¯¯¯¯¯Xn does converge toward a constant, but if we scale it up by √n (and hence scale up its variance by n), then it converges toward a normal distribution (with non-moving variance).

In practice, the useful part of the Central Limit Theorem is that we can approximate √n¯¯¯¯¯Xn by Z even if n is a constant. (Otherwise, it’s unclear why we should care about √n¯¯¯¯¯Xn, and we already know the limiting behavior of ¯¯¯¯¯Xn.) This technically doesn’t follow from the theorem as-is since it makes no statements about how fast √n¯¯¯¯¯Xn converges, but there is also a quantitative version. And for a constant n, the formulations √n(¯¯¯¯¯Xn−μ)≈N(0,σ2) and ¯¯¯¯¯Xn≈N(μ,σ2n) and √n¯¯¯¯¯Xn≈N(√nμ,σ2) are indeed equivalent.

While converge in distribution doesn’t imply convergence in probability in general, it does do so if the limiting RV is a constant. Thus, the Weak Law of Large Numbers would immediately follow from the Central Limit theorem, except that the Weak Law of Large Numbers also applies in cases where the Xi have infinite variance. An example of a RV with finite mean but infinite variance is the following game: “flip a coin until it gets tails; receive √2n dollars where n is the number of heads”.

A consequence of the ‘practical’ version of the central limit theorem is that the binomial distribution can be approximated by a normal distribution. Let X∼Binom(n,p). Then X=∑nk=1Xi with Xi∼Bernoulli(p). Right now, X just sums up these variables without dividing by n, but it’s trivial verify that X=n⋅¯¯¯¯¯Xn=√n(√n¯¯¯¯¯Xn). Thus, since √n¯¯¯¯¯Xn≈N(√np,p(1−p)) (here I’ve used that μ=p and σ2=p(1−p) for Bernoulli Variables), we have X≈N(np,np(1−p)).

Here’s an example. Note that, if Z∼N(0,1) (it’s common practice to write everything in terms of a standard normal RV and call that Z), then X−np√np(1−p)≈Z. Thus, if we want to know the probability that at most 4900 out of 10000 coins come up heads, we can compute

Pr(X≤4900)=Pr(X−5000≤−100)=Pr(X−500050≤−2)≈Pr(Z≤−2)

which gives ≈0.0275 according to WolframAlpha.

## 3.3 Convergence under Transformation

In which we study what can be said about g(¯¯¯¯¯X), where g is a function.

I don’t fully understand how this can be made rigorous, but it’s a known theorem so the idea is valid. ↩︎

The former definition is known as the average absolute deviation, but regular variance is far more dominant. ↩︎