Insights from “All of Statistics”: Statistical Inference
(This is the second of two posts on the textbook All of Statistics. Click here for post I.)
4. Fundamentals of Statistical Inference
Probability theory is about using distributions to derive information about outcomes. Given the assumption that , we can compute probabilities of outcomes the form . Statistics is about the opposite: using outcomes to derive information about distributions.
The book exclusively considers the setting where we observe independently sampled data points where the distribution of is unknown. It is often convenient to talk about the RV that generated the -th sample point, the logical notation for which is . The book uses upper-case letters for both (e.g., ’we observe data ”), not differentiating between the RV that generates the -th point (i.e., ) and the k-th point itself ().
On the highest level, one can divide all inference problems into two categories: parametric inference and non-parametric inference. In parametric Inference, we assume that we know the family of distributions that belongs to (Binomial, Poission, Normal, etc.). In this case, the problem reduces to inferring the parameters that characterize distributions in that family. For example, if we observe data on traffic accidents, we may assume that , and all that’s left to estimate is . Conversely, in the context of non-parametric inference, one does not assume that belongs to any one family and thus has to estimate the distribution directly.
5. Bayesian Parametric Inference
5.1. The concept
Let be the parameter we wish to learn about. Since we don’t know its value, we treat it as a RV that ranges over several possible values. Furthermore, let be our observed data, and let be the RVs that generate this data.
Bayes’ theorem is an equation relating to . (With In our case, we are interested in terms of the form , and we know how to compute terms of the form because once we fix a value of , the distribution is known and we can compute the required probabilities. Thus, Bayes’ theorem gives us a way to reduce a problem of statistical inference to a problem of probability theory.
Alas, we have
Our prior on will virtually always be continuous. Let be the pdf of this prior. The may be continuous or discrete. Let be their pdf or pmf. Our formula becomes
Note that the denominator does not depend on : once we observe the data, how likely it was to appear has no impact on how we weigh different values of against each other. Thus, we have
Suppose we begin by computing the right side. If is a pmf, the result is an incorrectly scaled pdf (since we multiply a density with a product of masses, which is just a scalar). If is a pdf, the result will be the product of a density with a product of other densities. If you think of a density as a mass multiplied by an infinitesimal, then it’s a term with an infinitesimal to the power of . However, one can ignore this fact entirely, compute the integral of the [non-density-function obtained by evaluating the right side above], then scale that non-density-function by the inverse of that integral, and the result will be the correct density function, (I really need to study nonstandard analysis at some point). Since computing is usually harder than computing this integral, this is the standard approach.
Using the fact that the are independently sampled, we can further rewrite the formula as
The only open question is how to choose the prior distribution on . The philosophically correct thing to do is to use ones prior beliefs about , but this has the issue of making methods depend on the person carrying them out. There are various methods suggested in the literature that construct analytical priors, but there is no consensus as to which one is best.
5.2. A discrete Example
Let with . Note that here, is a probability, so any reasonable prior should put zero density to values outside of . For now, assume our prior on is uniform, i.e., for .
Define . We have
Usuually, we would now have to compute the integral . However, in this case, our posterior has a Beta distribution, . This means we already know that the integral sums up to , which means that our posterior pdf is
Here is a plot of (black) and (red) for and :
5.3. A continuous Example
Let , and let be samples of . Assume we have prior for . Let and suppose that . Then,
which means we need to scale by , and our posterior pdf is
Here is a plot of (black) and (red):
The book mentions three distinct use cases for statistical inference:
Point-estimation: provide a best guess for a specific object, like the parameter of interest.
Confidence intervals: give an interval such that we can say with a given certainty that the parameter is in this interval. This is often reported as ‘the margin of error’. If a pollster reports that 27% of people like something and the margin of error is 3 points, this probably means that with 0.95% confidence. There is nothing special about , but it’s the convention.
Hypothesis testing, which we’ll cover in chapter 7.
To obtain a point estimate, one may take the mean, median, or mode of the posterior distribution. To obtain a confidence interval, suppose we want certainty . Then, we need only find numbers such that
If we want, we can always choose such that our point estimate of is exactly in the middle of the interval .
6. Frequentist Methods
6.1. The concept
The prevalent school of thought in statistics dictates that the unknown parameter is not a RV but a constant. Therefore, they reject Bayesian methods as they make probability statements about . Instead, Frequentist methods deal in terms of what are called estimators. An estimator is an arbitrary function from the data that is meant to estimate a certain quantity (such as a parameter of interest). Then, there are rigorous analytical methods to evaluate estimators.
There are a bunch of commonly used estimators that are applicable to many problems. However, for any specific problem, one may always construct a nonstandard estimator.
In this chapter, we use to denote a space of possible values for the unknown parameter, and we denote the parameter itself as . (We don’t need a symbol for the variable in the pdf of because it doesn’t have a pdf.) Also, we can no longer condition on having a certain value, so we write instead of , and think of as denoting a family of probability densities with being one of them, rather than as one density that we can update on the event .
Suppose we observe samples from some distribution parametrized by . Let be an estimator of . Since is a function of the data and we can view each data point as generated by a random variable , we can view itself as a random variable that is some function of the . For example, if we wish to estimate the mean of the distribution, then is the obvious choice. This means we can compute
which is called the Mean Squared Error. Note that computing the mean Squared Error will yield a term involving , something like .
If , the Mean Squared Error equals . In this case,
is 0, and we call the estimator unbiased.
On first glance, being unbiased sounds like a good thing. It took me a while to understand exactly why, form a Bayesian perspective, bias is desirable:
The obvious Bayesian point estimate for is the mean of the posterior distribution.
For this estimate to be unbiased, it would have to be the case that, if we condition on , in expectation (that’s expectation over the generated data given that ) the posterior distribution of centers around 5.
However, a proper Bayesian update (even on data that is typical for the case that ) only moves the distribution some portion of the way toward the evidence, not the entire way.
Thus, the posterior distribution’s mean will be closer to 5 than that of the prior distribution, but it won’t be all the way there.
In other words, Bayes estimators are systematically biased toward the prior distribution, whereas Frequentist methods do not include a prior and hence can be unbiased. A recurring theme is that Frequentist methods tend to look like (or even be identical to) Bayesian methods with flat priors.
6.3. An Example
Consider again the case of a continuous RV with unknown, where we have samples of . In this case, any function that maps the data to an estimated value of is a valid estimator.
Here are two estimators that make intuitive sense:
(if is large, then the maximum should be close to the largest sample)
(the maximum should be about twice the mean)
In this case, the first estimator is biased ( is, in fact, guaranteed to be smaller than ), whereas the second estimator is unbiased. Which estimator is better?
At first, this seems completely unclear. You could now do two things. One is to compute the Mean Squared Error (this can be done exactly). The other is to take a look at what the Bayesian estimator was doing. In particular, we can see that the posterior distribution is solely determined by the maximum of the . This suggests that is better, which is indeed true. (It can be scaled by to make it unbiased, but even the biased version is superior if is large.)
6.4. Commonly used Estimators
6.4.1. Maximum Likelihood Estimator
The Maximum Likelihood Estimator is the function outputting the value for such that the probability mass or density for the observed data is maximized, i.e.,
Thus, the Maximum Likelihood estimator outputs the mode of the posterior distribution obtained by doing a Bayesian Update with a flat prior. In the example from 6.3, the Maximum Likelihood estimator is just .
In the case where denotes a probability, we have and a flat prior is possible. In the case where , a flat prior is mathematically impossible in standard analysis because does not converge. It also “implies” that since that interval is infinitely smaller than and our prior is uniform.
6.4.2. Empirical Distribution Function
In non-parametric inference, the empirical distribution function is an estimate for the distribution that generated the data. It simply puts probability mass on every data point . Let be this function. It’s pretty easy to see that, for any probability mass function , we have
Thus, the empirical distribution function is the maximum likelihood estimator applied to the probability distribution rather than to a parameter.
6.4.3. Method of Moments
The method of moments is a method for parametric inference that is not a special case of Bayesian Inference. Recall that a statistical functional provides a single number that encodes some property of a RV. Suppose that we can estimate this number for an observed data set. Then, we can estimate the value of a parameter in two steps:
Estimate the statistical functional for the data set
Choose such that the statistical functional of equals that number
If we have several parameters to estimate, just do the same with several statistical functionals. This leads to a system with equations and unknowns.
The statistical functionals we use are the moments, hence the name. Recall that the -th moment is defined as
We estimate the -th moment of a data set in the obvious way:
Since the value of the ’s depends on the pdf , which depends on our parameters, computing the yields a term that includes those parameters, whereas computing the just yields numbers. Thus, if we have two parameters, then
is actually a 2 by 2 system of equations, with the variables being our two parameters.
The method of moments is not super accurate, but has the advantage of being computationally cheap.
In the example from 6.3, and . Thus, the method of moments estimator is just .
7. Hypothesis Testing and -Values
Suppose we want to learn about some parameter . In this case, think of as measuring some effect we care about, say the difference in [frequency of trait X among people with trait Y] and [frequency of trait X among all people]. Let be the parameter space of , and let be the subset in which “there is no effect”. The claim that there is no effect, i.e., that is called the null hypothesis, and the goal of a test is to refute the null hypothesis.
Let be our data. Let be a function from the data to . Formally, this is the same as an estimator, but here we call it the test statistic. should be such that it yields a high number if there is an effect and a low number if there is not (or vice versa).
Once we evaluate , we get some number that is somewhere on the less-effect-y more-effect-y axis. Let be the subset of results that looks at least as effect-y as . Here’s a visualization:
We define the -value of this test as the maximum chance that the result landed in even though the null hypothesis was true, i.e.,
And that’s it; that’s a -value. This is all well-defined since, once a is fixed, is just a RV, and since is a subset of the form the condition can be rewritten as .
In the ideal case, a -value of 0.002 means “under the assumption that the effect we are testing for doesn’t exist (as defined by our choice of ), there was at most a 0.2% chance for the data to show as much of an effect or more as we have observed”. Not that this is not a probability statement about . (Frequentists don’t want to make probability statements about since it’s a constant.) The next chapter will work through an example of how to use -values (which is also an example of where they ‘fail’, and it will illustrate why saying ‘it’s not a probability statement about ’ is more than a technicality).
In classical Hypothesis Testing, one defines the region ahead of time, evaluates on the data, and reports a binary ‘null-hypothesis rejected’ if and ‘null-hypothesis not rejected’ if . This is much worse since now the choice of is arbitrary, and one can always find two points on such that suggests a 0.00000000000000001% stronger effect than , yet and . Compared to this, reporting a -value is amazingly informative.
8.1. … of -values
Suppose we have a not-necessarily-fair coin with unknown probability of getting heads. We want to refute the null hypothesis ‘the coin is fair’ defined by the one-point parameter subspace .
Here are two possible experiments to test this hypothesis.
9.1.1. EXPERIMENT ONE
Experiment one is: “flip the coin twelve times”. Our sample statistic is the RV that counts the number of heads. Obviously, . We obtain the result . The region (of all outcomes that are at least as extreme as the observed one) is . The -value is thus
8.1.2. EXPERIMENT TWO
Experiment two is: “flip the coin until it lands on heads three times”. Our sample statistic is the RV that counts the number of tosses that came up tails. We have . (The negative binomial distribution hasn’t made it into the book or into my previous post; it’s a generalization of the Geometric distribution that counts to misses rather than to 1.) We obtain the result (i.e., nine tails, three heads). The region (of all outcomes that are at least as extreme as the observed one) is . The -value is thus
8.1.3. THE FAILURE
The first experiment yielded a -value of 0.073, whereas the second one yielded a -value of 0.0328. To see why this is an issue, consider a Bayesian update. Let be an arbitrary distribution over (now uppercase since we treat it as a RV). In experiment one, we have twelve samples of with heads and tails, hence
In experiment two, we have twelve samples of with heads and tails, hence
Thus, since both experiments have observed the same data, the posterior distribution is exactly the same in both cases (provided that both use the same prior.) Nonetheless, the experiments have yielded different -values—and what’s worse, the expected -value is also different! Given a fixed , the second experiment systematically produces lower -values, both in terms of their mean and in terms of their median.
Despite having heard lots of people rail against -values in the past, I was pretty shocked by this result. I had previously thought that -values are bad because they provide bad incentives and give misleading impressions. I didn’t know that they have an in-principle arbitrary component.
I don’t think I understand why -values are lower in the second experiment, though. What properties of the distribution are responsible? If anyone knows, please explain.
8.2. … of Bayesian inference
Bayesian Inference relies on Bayes theorem, which is a theorem. Can it really fail? We shall see. The book mentions two cases meant to showcase weaknesses of Bayesian inference.
The first one is very silly. It comes down to the following experiment where we have a googol biased coins: “randomize a number between and , report the result of a flip by the -th biased coin; repeat times”. Then, the book observes that, for any realistic , using Bayesian inference on the unknown probabilities of the biased coins will fail since most of them are never observed. Of course, this has zero to do with Bayesian methods and everything with what we choose to update; if one estimates the total probability of getting heads, Bayesian inference does wonderfully.
The second one is much more interesting. Let be a continuous, integrable function such that . In general, is not a pdf, but is.
Suppose we know and want to estimate . One way to obtain is by computing the integral, but this may be difficult. Instead, suppose we are given points that are independently sampled according to .
Let be any prior on . We have
This expression with the prior excluded is larger the smaller is. Thus, this experiment will update toward lower values regardless of the observed data points .
… wait, what? What happened?
Here is what I think happened. The function fully determines the value of . Since we know , one cannot model as a random variable. In fact must be 1 since , hence . If anything is a logically uncertain variable à la Logical Induction. Since probability theory cannot handle logical uncertainty, the method fails.
However, it is interesting that there is an estimator that does a fine job approximating . This estimator is , where the are estimates of the probability distribution of the points based on the observed data . (The book doesn’t explain how this is done.) Thus, the concept of ‘estimator’ appears to include methods that deal with logical uncertainty.
Alas, it appears that the second example highlights a legitimate failure of Bayesian inference.
9. Statistical Decision Theory
So far, we’ve evaluated predictors by computing their Mean Squared Error. One can also do this for Bayesian methods. In the discrete example (5.2.), the posterior distribution has mean (no computations are needed to derive this result since the mean of a Beta distribution is known), so is just another estimator that happens to have been derived with Bayesian methods.
However, recall that estimating the Mean Squared Error yields a term involving the true parameter . This can be considered a function of , and thus, one can imagine two estimators whose functions cross paths, such that both of them yield lower errors for some values of . In such a case, it’s not immediately obvious which one is preferable. Statistical Decision Theory is about making this decision.
Let be an estimator. As mentioned, its mean squared error is a function of . If we want to compare it to the mean squared error of other estimators, we need a one-point summary of this function. Here are two ways of doing this:
I.e., we can either optimize for the worst case and minimize the maximum error across all possible values of , or we can weigh the errors depending on which is more probable. Of course, this again requires a prior on , hence why it’s called the Bayes risk.
One can prove that:
If is a proper prior, the estimator that minimizes is the function that maps data to the mean of the posterior distribution, i.e., the estimator given by .
If is a proper prior and one replaces the Mean Squared Error with the absolute loss, i.e., instead of , then it’s the median of the posterior distribution. (This was quite surprising to me. I always thought that, if one has values , then, surely, (the mean) is a better summary of this data than 3 (the median). However, if one defines ‘better’ in terms of ‘lower mean absolute distance to one of the elements’, then is better! This quickly becomes obvious if you think about it; once you’re at the median element, any step to either side increases the distance to more than half of all points and decreases the distance to fewer than half, by the same amount.)
If is a proper prior and one replaces the Mean Squared Error with the binary error (1 if is hit exactly and otherwise), then it’s the mode of the posterior distribution—that is, provided the distribution of is discrete; if it’s continuous, the loss is 1 with probability 1 regardless of the estimator.
Thus, Bayesian methods are optimal to minimize the Bayes risk.
Suppose a Bayes estimator for prior yields an error that does not depend on , i.e., an error function that is a constant. In this case, the one-point summary is just that constant, which means that . Thus, one can also view minimization of as a special case of applying Bayesian methods. In this case, is called the least favorable prior.
10. Other Concepts
10.1. Score and Fisher Information
Consider a parametric inference problem. Let be the parameter and the density function. This time, we consider to be a function of two variables and .
The score is defined as the derivative of with respect to , i.e.,
The score is again a function of two variables. However, if we fix a , then the function defined by is a RV. For any point , measures how increasing affects the density that assigns to .
Suppose that . That would mean that increasing increases the overall density of . However, this is impossible: is still a valid pdf, so the overall density cannot change. Thus, the expectation of the score is always zero. This can also be proven formally.
On the other hand, consider the variance of the score, . We can imagine several cases:
If does not depend on at all, then the score is zero everywhere. In this case, its variance is also zero.
If can take drastically different forms depending on the value of , then the score should be high in some places and low in others. In this case, the variance is high.
Thus, the variance of the score is a measure for how much the parameter impacts the distribution. Equivalently (I believe, I don’t know information theory), it’s a measure of how much information [samples of the distribution] reveal about . Note that evaluating this variance for a specific yields a number; if is left as a parameter, it yields a term that includes . The variance of the score is also called the Fisher Information, and it’s denoted as a function (of ). Wikipedia lists the fisher information for all important distributions; for example, if , then the fisher information is .
One can prove that
If is the maximum likelihood estimator of , then the Fisher Information takes a role similar to the variance of and one can use this to prove theorems about the limit distribution of . Furthermore, in the context of Bayesian Inference, there is something called Jeffrey’s Prior, which says that the prior on a parameter should be proportional to the root of its Fisher Information. This does not seem to make sense to me, but perhaps I don’t understand it.
Which is a fairly complicated method for estimating the variance of statistics.
10.3. The Jacknife
Same as above.
Which is a Monte Carlo method to estimate things that could in theory be computed but may be hard to compute, such as the marginal distribution of when the distribution of a vector is known.
10.5. Multiparameter Models for Parametric Inference
In which (who’d have guessed) parametric inference with multiple parameters is studied.
10.6. Various tests
As in ‘specific ways to do hypothesis testing’. There are also Bayesian methods here.
Which is a property of estimators. An estimator is inadmissible if there is another estimator that performs at least as well for all , and better for at least one . It’s admissible if it’s not inadmissible. Bayes’ estimators with proper priors are always admissible.
10.7.1. Stein’s Paradox
Suppose we sample and want to estimate the . Stein’s paradox says that the estimator is inadmissible if (and admissible for and ).
10.8. Plug-in Estimators
The empirical distribution function gives a general way to estimate the value of any statistical functional on an unknown distribution, provided we have access to sample points . This way is, of course, to simply compute the statistical functional of the empirical distribution function derived from .
In the case of the mean, this leads to the obvious estimate that is used in the method of moments. For the variance, it leads to the estimate . Interestingly, this estimate is biased. The unbiased estimate is called the sample variance and it’s defined by . The reason for this is that any term of the form is, in expectation, slightly lower than it should be because has contributed to the estimate , thus causing to be closer to in expectation than the true mean. For example, suppose we have just two data points and . If they are both below the mean, the estimate will be too small, so the terms and will be too small as well. If they are both above the mean, the estimate will be too large, and the terms and will again be too small. Thus, regardless of the direction into which the noise skews the data points, the plug-in estimate for the variance will underestimate the true variance, making it biased. If one does the computation, it turns out that the effect of this bias is, on expectation, precisely the factor .