The trouble with Bayes (draft)

Prerequisites

This post requires some knowledge of Bayesian and Frequentist statistics, as well as probability. It is intended to explain one of the more advanced concepts in statistical theory—Bayesian non-consistency—to non-statisticians, and although the level required is much less than would be required to read some of the original papers on the topic[1], some considerable background is still required.

The Bayesian dream

Bayesian methods are enjoying a well-deserved growth of popularity in the sciences. However, most practitioners of Bayesian inference, including most statisticians, see it as a practical tool. Bayesian inference has many desirable properties for a data analysis procedure: it allows for intuitive treatment of complex statistical models, which include models with non-iid data, random effects, high-dimensional regularization, covariance estimation, outliers, and missing data. Problems which have been the subject of Ph. D. theses and entire careers in the Frequentist school, such as mixture models and the many-armed bandit problem, can be satisfactorily handled by introductory-level Bayesian statistics.

A more extreme point of view, the flavor of subjective Bayes best exemplified by Jaynes’ famous book [2], and also by an sizable contingent of philosophers of science, elevates Bayesian reasoning to the methodology for probabilistic reasoning, in every domain, for every problem. One merely needs to encode one’s beliefs as a prior distribution, and Bayesian inference will yield the optimal decision or inference.

To a philosophical Bayesian, the epistemological grounding of most statistics (including “pragmatic Bayes”) is abysmal. The practice of data analysis is either dictated by arbitrary tradition and protocol on the one hand, or consists of users creatively employing a diverse “toolbox” of methods justified by a diverse mixture of incompatible theoretical principles like the minimax principle, invariance, asymptotics, maximum likelihood or *gasp* “Bayesian optimality.” The result: a million possible methods exist for any given problem, and a million interpretations exist for any data set, all depending on how one frames the problem. Given one million different interpretations for the data, which one should *you* believe?

Why the ambiguity? Take the textbook problem of determining whether a coin is fair or weighted, based on the data obtained from, say, flipping it 10 times. Keep in mind, a principled approach to statistics decides the rule for decision-making before you see the data. So, what rule whould you use for your decision? One rule is, “declare it’s weighted, if either ¹⁰⁄₁₀ flips are heads or ⁰⁄₁₀ flips are heads.” Another rule is, “always declare it to be weighted.” Or, “always declare it to be fair.” All in all, there are 10 possible outcomes (supposing we only care about the total) and therefore there are 2^10 possible decision rules. We can probably rule out most of them as nonsensical, like, “declare it to be weighted if ⁵⁄₁₀ are heads, and fair otherwise” since ⁵⁄₁₀ seems like the fairest outcome possible. But among the remaining possibilities, there is no obvious way to choose the “best” rule. After all, the performance of the rule, defined as the probability you will make the correct conclusion from the data, depends on the unknown state of the world, i.e. the true probability of flipping heads for that particular the coin.

The Bayesian approach “cuts” the Gordion knot of choosing the best rule, by assuming a prior distribution over the unknown state of the world. Under this prior distribution, one can compute the average perfomance of any decision rule, and choose the best one. For example, suppose your prior is that with probability 99.9999%, the coin is fair. Then the best decision rule would be to “always declare it to be fair!”

The Bayesian approach gives you the optimal decision rule for the problem, as soon as you come up with a model for the data and a prior for your model. But when you are looking at data analysis problems in the real world (as opposed to a probability textbook), the choice of model is rarely unambiguous. Hence, for me, the standard Bayesian approach does not go far enough—if there are a million models you could choose from, you still get a million different conclusions as a Bayesian.

Hence, one could argue that a “pragmatic” Bayesian who thinks up a new model for every problem is just as epistemologically suspect as any Frequentist. Only the strongest form of subjective Bayesianism can one escape this ambiguity. The dream for the subjective Bayesian dream is to start out in life with a single model. A single prior. For the entire world. This “world prior” would contain all the entirety of one’s own life experience, and the grand total of human knowledge. Surely, writing out this prior is impossible. But the point is that a true Bayesian must behave (at least approximately) as if they were driven by such a universal prior. In principle, having such an universal prior (at least conceptually) solves the problem of choosing models and priors for problems: the priors and models you choose for particular problems are determined by the posterior of your universal prior. For example, why did you decide on a linear model for your economics data? It’s because according to your universal posterior, you particular economic data is well-described by such a model with high-probability.

The main practical consequence of the universal prior is that your inferences in one problem should be consistent which your inferences in another, related problem. Even if the subjective Bayesian never writes out a “grand model”, their integrated approach to data analysis for related problems still distinguishes their approach from the piecemeal approach of frequentists, who tend to treat each data analysis problem as if it occurs in an isolated universe. (So I claim, though I cannot point to any real example of such a subjective Bayesian.)

Yet, even if the subjective Bayesian ideal could be realized, many philosophers of science (e.g. Deborah Mayo) would consider it just as ambiguous as non-Bayesian approaches, since even if you have an unambiguous proecdure for forming personal priors, your priors are still going to differ from mine. I don’t consider this a defect, since my worldview necessarily does differ from yours. My ultimate goal is to make the best decision for myself. That said, such egocentrism, even if rationally motivated, may indeed be poorly suited for a collaborative enterprise like science.

For me, the most far more troublesome objection to the “Bayesian dream” is the question, “How would actually you go about constructing this prior that represents all of your beliefs?” Looking in the Bayesian literature, one does not find any convincing examples of any user of Bayesian inference managing to actually encode all (or even a tiny portion) of their beliefs in the form of the prior—in fact, for the most part, we see alarmingly little thought or justification being put into the construction of the priors.

Nevertheless, I myself remained one of these “hardcore Bayesians”, at least from a philosophical point of view, ever since I started learning about statistics. My faith in the “Bayesian dream” persisted even after spending three years in the Ph. D. program in Stanford (a department with a heavy bias towards Frequentism) and even after I personally started doing research in frequentist methods. (I see frequentist inference as a poor man’s approximation for the ideal Bayesian inference.) Though I was aware of the Bayesian non-consistency results, I largely dismissed them as mathematical pathologies. And while we were still a long way from achieving universal inference, I held the optimistic view that improved technology and theory might one day finally make the “Bayesian dream” achievable. However, I could not find a way to ignore one particular example on Wasserman’s blog[3], due to its relevance to very practical problems in causal inference. Eventually I thought of an even simpler counterexample, which devastated my faith in the possibility of constructing a universal prior. Perhaps a fellow Bayesian can find a solution to this quagmire, but I am not holding my breath.

The root of the problem is the extreme degree of ignorance we have about our world, the degree of surprisingness of many true scientific discoveries, and the relative ease with which we accept these surprises. If we consider this behavior rational (which I do), then the subjective Bayesian is obligated to construct a prior which captures this behavior. Yet, the diversity of possible surprises the model must be able to accommodate makes it practically impossible (if not mathematically impossible) to construct such a prior. The alternative is to reject all possibility of surprise, and refuse to update any faster than a universal prior would (extremely slowly), which strikes me as a rather poor epistemological policy.

In the rest of the post, I’ll motivate my example, sketch out a few mathematical details (explaining them as best I can to a general audience), then discuss the implications.

Introduction: Cancer classification

Biology and medicine are currently adapting to the wealth of information we can obtain by using high-throughput assays: technologies which can rapidly read the DNA of an individual, measure the concentration of messenger RNA, metabolites, and proteins. In the early days of this “large-scale” approach to biology which began with the Human Genome Project, some optimists had hoped that such an unprecedented torrent of raw data would allow scientists to quickly “crack the genetic code.” By now, any such optimism has been washed away by the overwhelming complexity and uncertainty of human biology—a complexity which has been made clearer than ever by the flood of data—and replaced with a sober appreciation that in the new “big data” paradigm, making a discovery becomes a much easier task than understanding any of those discoveries.

Enter the application of machine learning to this large-scale biological data. Scientists take these massive datasets containing patient outcomes, demographic characteristics, and high-dimensional genetic, neurological, and metabolic data, and analyze them using algorithms like support vector machines, logistic regression and decision trees to learn predictive models to relate key biological variables, “biomarkers”, to outcomes of interest.

To give a specific example, take a look at this abstract from the Shipp. et. al. paper on detecting survival rates for cancer patients [4]:

Diffuse large B-cell lymphoma (DLBCL), the most common lymphoid malignancy in adults, is curable in less than 50% of patients. Prognostic models based on pre-treatment characteristics, such as the International Prognostic Index (IPI), are currently used to predict outcome in DLBCL. However, clinical outcome models identify neither the molecular basis of clinical heterogeneity, nor specific therapeutic targets. We analyzed the expression of 6,817 genes in diagnostic tumor specimens from DLBCL patients who received cyclophosphamide, adriamycin, vincristine and prednisone (CHOP)-based chemotherapy, and applied a supervised learning prediction method to identify cured versus fatal or refractory disease. The algorithm classified two categories of patients with very different five-year overall survival rates (70% versus 12%). The model also effectively delineated patients within specific IPI risk categories who were likely to be cured or to die of their disease. Genes implicated in DLBCL outcome included some that regulate responses to B-cell−receptor signaling, critical serine/threonine phosphorylation pathways and apoptosis. Our data indicate that supervised learning classification techniques can predict outcome in DLBCL and identify rational targets for intervention.

The term “supervised learning” refers to any algorithm for learning a predictive model for predicting some outcome Y(could be either categorical or numeric) from covariates or features X. In this particular paper, the authors used a relatively simple linear model (which they called “weighted voting”) for prediction.

A linear model is fairly easy to interpret: it produces a single “score variable” via a weighted average of a number of predictor variables. Then it predicts the outcome (say “survival” or “no survival”) based on a rule like, “Predict survival if the score is larger than 0.” Yet, far more advanced machine learning models have been developed, including “deep neural networks” which are winning all of the image recognition and machine translation competitions at the moment. These “deep neural networks” are especially notorious for being difficult to interpret. Along with similarly complicated models, neural networks are often called “black box models”: although you can get miraculously accurate answers out of the “box”, peering inside won’t give you much of a clue as to how it actually works.

Now it is time for the first thought experiment. Suppose a follow-up paper to the Shipp paper reports dramatically improved prediction for survival outcomes of lymphoma patients. The authors of this follow-up paper trained their model on a “training sample” of 500 patients, then used it to predict the five-year outcome of chemotherapy patients, on a “test sample” of 1000 patients. It correctly predicts the outcome (“survival” vs “no survival”) on 990 of the 1000 patients.

Question 1: what is your opinion on the predictive accuracy of this model on the population of chemotherapy patients? Suppose that publication bias is not an issue (the authors of this paper designed the study in advance and committed to publishing) and suppose that the test sample of 1000 patients is “representative” of the entire population of chemotherapy patients.

Question 2: does your judgment depend on the complexity of the model they used? What if the authors used an extremely complex and counterintuitive model, and cannot even offer any justification or explanation for why it works? (Nevertheless, their peers have independently confirmed the predictive accuracy of the model.)

A Frequentist approach

The Frequentist answer to the thought experiment is as follows. The accuracy of the model is a probability p which we wish to estimate. The number of successes on the 1000 test patients is Binomial(p, 1000). Based on the data, one can construct a confidence interal: say, we are 99% confident that the accuracy is above 83%. What does 99% confident mean? I won’t try to explain, but simply say that in this particular situation, “I’m pretty sure” that the accuracy of the model is above 83%.

A Bayesian approach

The Bayesian interjects, “Hah! You can’t explain what your confidence interval actually means!” He puts a uniform prior on the probability p. The posterior distribution of p, conditional on the data, is Beta(991, 11). This gives a 99% credible interval that p is in [0.978, 0.995]. You can actually interpret the interval in probabilistic terms, and it gives a much tighter interval as well. Seems like a Bayesian victory...?

A subjective Bayesian approach

As I have argued before, a Bayesian approach which comes up with a model after hearing about the problem is bound to suffer from the same inconsistency and arbitariness as any non-Bayesian approach. You might assume a uniform distribution for p in this problem… but yet another paper comes along with a similar prediction model? You would need a join distribution for the current model and the new model. What if a theory comes along that could help explain the success of the current method? The parameter p might take a new meaning in this context.

So as a subjective Bayesian, I argue that slapping a uniform prior on the accuracy is the wrong approach. But I’ll stop short of actually constructing a Bayesian model of the entire world: let’s say we want to restrict our attention to this particular issue of cancer prediction. We want to model the dynamics behind cancer and cancer treatment in humans. Needless to say, the model is still ridiculously complicated. However, I don’t think it’s out of reach of the efforts of a well-funded, large collaborative effort of scientists.

Roughly speaking, the model can be divided into a distribution over theories of human biology, and conditional on the theory of biology, a course-grained model of an individual patient. The model would not include every cell, every molecule, etc., but it would contain many latent variables in addition to the variables measured in any particular cancer study. Let’s call the variables actually measured in the study, X, and also the survival outcome, Y.

Now here is the epistemologically correct way to answer the thought experiment. Take a look at the X’s and Y’s of the patients in the training and test set. Update your probabilistic model of human biology based on the data. Then take a look at the actual form of the classifier: it’s a function f() mapping X’s to Y’s. The accuracy of the classsifer is no longer parameter: it’s a quantity Pr[f(X) = Y] which has a distribution under your posterior. That is, for any given “theory of human biology”, Pr[f(X) = Y] has a fixed value: now, over the distribution of possible theories of human biology (based on the data of the current study as well as all previous studies and your own beliefs) Pr[f(X) = Y] has a distribution, and therefore, an average. But what will this posterior give you? Will you get something similar to the interval [0.978, 0.995] you got from the “practical Bayes” approach?

Who knows? But I would guess in all likelihood not. My guess you would get a very different interval from [0.978, 0.995], because in this complex model there is no direct link from the empirical success rate of prediction, and the quantity Pr[f(X) = Y]. But my intuition for this fact comes from the following simpler framework.

A non-parametric Bayesian approach

Instead of reasoning about a gand Bayesian model of biology, I now take a middle ground, and suggesting that while we don’t need to capture the entire latent dynamics of cancer, we should at the very least we should try to include the X’s and the Y’s in the model, instead of merely abstracting the whole experiment as a Binomial trial (as did the frequentist and pragmatic Bayesian.) Hence we need a prior over joint distributions of (X, Y). And yes, I do mean a prior distribution over probability distributions: we are saying that (X, Y) has some unknown joint distribution, which we treat as being drawn at random from a large collection of distributions. This is therefore a non-parametric Bayes approach: the term non-parametric means that the number of the parameters in the model is not finite.

Since in this case Y is a binary outcome, a joint distribution can be decomposed as a marginal distribution over X, and a function g(x) giving the conditional probability that Y=1 given X=x. The marginal distribution is not so interesting or important for us, since it simple reflects the composition of the population of patients. For the purpose of this example, let us say that the marginal is known (e.g., a finite distribution over the population of US cancer patients). What we want to know is the probability of patient survival, and this is given by the function g(X) for the particular patient’s X. Hence, we will mainly deal with constructing a prior over g(X).

To construct a prior, we need to think of intuitive properties of the survival probability function g(x). If x is similar to x’, then we expect the survival probabilities to be similar. Hence the prior on g(x) should be over random, smooth functions. But we need to choose the smoothness so that the prior does not consist of almost-constant functions. Suppose for now that we choose particular class of smooth functions (e.g. functions with a certain Lipschitz norm) and choose our prior to to be uniform over functions of that smoothness. We could go further and put a prior on the smoothness hyperparameter, but for now we won’t.

Now, although I assert my faithfulness to the Bayesian ideal, I still want to think about how whatever prior we choose would allow use to answer some simple though experiments. Why is that? I hold that the ideal Bayesian inference should capture and refine what I take to be “rational behavior.” Hence, if a prior produces irrational outcomes, I reject that prior as not reflecting my beliefs.

Take the following thought experiment: we simply want to estimate the expected value of Y, E[Y]. Hence, we draw 100 patients independently with replacement from the population and record their outcomes: suppose the sum is 80 out of 100. The Frequentist (and prgamatic Bayesian) would end up concluding that with high probability/confidence/whatever, the expected value of Y is around 0.8, and I would hold that an ideal rationalist come up with a similar belief. But what would our non-parametric model say? We would draw a random function g(x) conditional on our particular observations: we get a quantity E[g(X)] for each instantiation of g(x): the distribution of E[g(X)]’s over the posterior allows us to make credible intervals for E[Y].

But what do we end up getting? Either one of two things happens. Either you choose too little smoothness, and E[g(X)] ends up concentrating at around 0.5, no matter what data you put into the model. This is the phenomenon of Bayesian non-consistency, and a detailed explanation can be found in several of the listed references: but to put it briefly, sampling at a few isolated points gives you too little information on the rest of the function. This example is not as pathological as the ones used in the literature: if you sample infinitely many points, you will eventually get the posterior to concentrate around the true value of E[Y], but all the same, the convergence is ridiculously slow. Alternatively, use a super-high smoothness, and the posterior of E[g(X)] has a nice interval around the sample value just like in the Binomial example. But now if you look at your posterior draws of g(x), you’ll notice the functions are basically constants. Putting a prior on smoothness doesn’t change things: the posterior on smoothness doesn’t change, since you don’t actually have enough data to determine the smoothness of the function. The posterior average of E[g(X)] is no longer always 0.5: it gets a little bit affected by the data, since within the 10% mass of the posterior corresponding to the smooth prior, the average of E[g(X)] is responding to the data. But you are still almost as slow as before in converging to the truth.

At the time that I started thinking about the above “uniform sampling” example, I was stil convinced of a Bayesian resolution. Obviously, using a uniform prior over smooth functions is too naive: you can tell by seeing that the prior distribution over E[g(X)] is already highly concentrated around 0.5. How about a hierarchical model, where first we draw a parameter p from the uniform distribution, and then draw g(x) from the uniform distribution over smooth functions with mean value equal to p? This gets you non-constant g(x) in the posterior, while your posteriors of E[g(X)] converge to the truth as quickly as in the Binomial example. Arguing backwards, I would say that such a prior comes closer to capturing my beliefs.

But then I thought, what about more complicated problems than computing E[Y]? What if you have to compute the expectation of Y conditional on some complicated function of X taking on a certain value: i.e. E[Y|f(X) = 1]? In the frequentist world, you can easily compute E[Y|f(X)=1] by rejection sampling: get a sample of individuals, average the Y’s of the individuals whose X’s satisfy f(X) = 1. But how could you formulate a prior that has the same property? For a finite collection of functions f, {f1,...,f100}, say, you might be able to construct a prior for g(x) so that the posterior for E[g(X)|fi = 1] converges to the truth for every i in {1,..,100}. I don’t know how to do so, but perhaps you know. But the frequentist intervals work for every function f! Can you construct a prior which can do the same?

I am happy to argue that a true Bayesian would not need consistency for every possible f in the mathematical universe. It is cool that frequentist inference works for such a general collection: but it may well be unnecessary for the world we live in. In other words, there may be functions f which are so ridiculous, that even if you showed me that empirically, E[Y|f(X)=1] = 0.9, based on data from 1 million patients, I would not believe that E[Y|f(X)=1] was close to 0.9. It is a counterintuitive conclusion, but one that I am prepared to accept.

Yet, the set of f’s which are not so ridiculous, which in fact I might accept to be reasonable based on conventional science, may be so large as to render impossible the construction of a prior which could accommodate them all. But the Bayesian dream makes the far stronger demand that our prior capture not just our current understanding of science but to match the flexibility of rational thought. I hold that given the appropriate evidence, rationalists can be persuaded to accept truths which they could not even imagine beforehand. Thinking about how we could possibly construct a prior to mimic this behavior, the Bayesian dream seems distant indeed.

Discussion

To be updated later… perhaps responding to some of your comments!

[1] Diaconis and Freedman, “On the Consistency of Bayes Estimates”

[2] ET Jaynes, Probability: the Logic of Science

[3] https://normaldeviate.wordpress.com/2012/08/28/robins-and-wasserman-respond-to-a-nobel-prize-winner/

[4] Shipp et al. “Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.” Nature