The trouble with Bayes (draft)


This post re­quires some knowl­edge of Bayesian and Fre­quen­tist statis­tics, as well as prob­a­bil­ity. It is in­tended to ex­plain one of the more ad­vanced con­cepts in statis­ti­cal the­ory—Bayesian non-con­sis­tency—to non-statis­ti­ci­ans, and al­though the level re­quired is much less than would be re­quired to read some of the origi­nal pa­pers on the topic[1], some con­sid­er­able back­ground is still re­quired.

The Bayesian dream

Bayesian meth­ods are en­joy­ing a well-de­served growth of pop­u­lar­ity in the sci­ences. How­ever, most prac­ti­tion­ers of Bayesian in­fer­ence, in­clud­ing most statis­ti­ci­ans, see it as a prac­ti­cal tool. Bayesian in­fer­ence has many de­sir­able prop­er­ties for a data anal­y­sis pro­ce­dure: it al­lows for in­tu­itive treat­ment of com­plex statis­ti­cal mod­els, which in­clude mod­els with non-iid data, ran­dom effects, high-di­men­sional reg­u­lariza­tion, co­var­i­ance es­ti­ma­tion, out­liers, and miss­ing data. Prob­lems which have been the sub­ject of Ph. D. the­ses and en­tire ca­reers in the Fre­quen­tist school, such as mix­ture mod­els and the many-armed ban­dit prob­lem, can be satis­fac­to­rily han­dled by in­tro­duc­tory-level Bayesian statis­tics.

A more ex­treme point of view, the fla­vor of sub­jec­tive Bayes best ex­em­plified by Jaynes’ fa­mous book [2], and also by an siz­able con­tin­gent of philoso­phers of sci­ence, ele­vates Bayesian rea­son­ing to the method­ol­ogy for prob­a­bil­is­tic rea­son­ing, in ev­ery do­main, for ev­ery prob­lem. One merely needs to en­code one’s be­liefs as a prior dis­tri­bu­tion, and Bayesian in­fer­ence will yield the op­ti­mal de­ci­sion or in­fer­ence.

To a philo­soph­i­cal Bayesian, the episte­molog­i­cal ground­ing of most statis­tics (in­clud­ing “prag­matic Bayes”) is abysmal. The prac­tice of data anal­y­sis is ei­ther dic­tated by ar­bi­trary tra­di­tion and pro­to­col on the one hand, or con­sists of users cre­atively em­ploy­ing a di­verse “toolbox” of meth­ods jus­tified by a di­verse mix­ture of in­com­pat­i­ble the­o­ret­i­cal prin­ci­ples like the min­i­max prin­ci­ple, in­var­i­ance, asymp­totics, max­i­mum like­li­hood or *gasp* “Bayesian op­ti­mal­ity.” The re­sult: a mil­lion pos­si­ble meth­ods ex­ist for any given prob­lem, and a mil­lion in­ter­pre­ta­tions ex­ist for any data set, all de­pend­ing on how one frames the prob­lem. Given one mil­lion differ­ent in­ter­pre­ta­tions for the data, which one should *you* be­lieve?

Why the am­bi­guity? Take the text­book prob­lem of de­ter­min­ing whether a coin is fair or weighted, based on the data ob­tained from, say, flip­ping it 10 times. Keep in mind, a prin­ci­pled ap­proach to statis­tics de­cides the rule for de­ci­sion-mak­ing be­fore you see the data. So, what rule whould you use for your de­ci­sion? One rule is, “de­clare it’s weighted, if ei­ther 1010 flips are heads or 010 flips are heads.” Another rule is, “always de­clare it to be weighted.” Or, “always de­clare it to be fair.” All in all, there are 10 pos­si­ble out­comes (sup­pos­ing we only care about the to­tal) and there­fore there are 2^10 pos­si­ble de­ci­sion rules. We can prob­a­bly rule out most of them as non­sen­si­cal, like, “de­clare it to be weighted if 510 are heads, and fair oth­er­wise” since 510 seems like the fairest out­come pos­si­ble. But among the re­main­ing pos­si­bil­ities, there is no ob­vi­ous way to choose the “best” rule. After all, the perfor­mance of the rule, defined as the prob­a­bil­ity you will make the cor­rect con­clu­sion from the data, de­pends on the un­known state of the world, i.e. the true prob­a­bil­ity of flip­ping heads for that par­tic­u­lar the coin.

The Bayesian ap­proach “cuts” the Gor­dion knot of choos­ing the best rule, by as­sum­ing a prior dis­tri­bu­tion over the un­known state of the world. Un­der this prior dis­tri­bu­tion, one can com­pute the av­er­age perfo­mance of any de­ci­sion rule, and choose the best one. For ex­am­ple, sup­pose your prior is that with prob­a­bil­ity 99.9999%, the coin is fair. Then the best de­ci­sion rule would be to “always de­clare it to be fair!”

The Bayesian ap­proach gives you the op­ti­mal de­ci­sion rule for the prob­lem, as soon as you come up with a model for the data and a prior for your model. But when you are look­ing at data anal­y­sis prob­lems in the real world (as op­posed to a prob­a­bil­ity text­book), the choice of model is rarely un­am­bigu­ous. Hence, for me, the stan­dard Bayesian ap­proach does not go far enough—if there are a mil­lion mod­els you could choose from, you still get a mil­lion differ­ent con­clu­sions as a Bayesian.

Hence, one could ar­gue that a “prag­matic” Bayesian who thinks up a new model for ev­ery prob­lem is just as episte­molog­i­cally sus­pect as any Fre­quen­tist. Only the strongest form of sub­jec­tive Bayesi­anism can one es­cape this am­bi­guity. The dream for the sub­jec­tive Bayesian dream is to start out in life with a sin­gle model. A sin­gle prior. For the en­tire world. This “world prior” would con­tain all the en­tirety of one’s own life ex­pe­rience, and the grand to­tal of hu­man knowl­edge. Surely, writ­ing out this prior is im­pos­si­ble. But the point is that a true Bayesian must be­have (at least ap­prox­i­mately) as if they were driven by such a uni­ver­sal prior. In prin­ci­ple, hav­ing such an uni­ver­sal prior (at least con­cep­tu­ally) solves the prob­lem of choos­ing mod­els and pri­ors for prob­lems: the pri­ors and mod­els you choose for par­tic­u­lar prob­lems are de­ter­mined by the pos­te­rior of your uni­ver­sal prior. For ex­am­ple, why did you de­cide on a lin­ear model for your eco­nomics data? It’s be­cause ac­cord­ing to your uni­ver­sal pos­te­rior, you par­tic­u­lar eco­nomic data is well-de­scribed by such a model with high-prob­a­bil­ity.

The main prac­ti­cal con­se­quence of the uni­ver­sal prior is that your in­fer­ences in one prob­lem should be con­sis­tent which your in­fer­ences in an­other, re­lated prob­lem. Even if the sub­jec­tive Bayesian never writes out a “grand model”, their in­te­grated ap­proach to data anal­y­sis for re­lated prob­lems still dis­t­in­guishes their ap­proach from the piece­meal ap­proach of fre­quen­tists, who tend to treat each data anal­y­sis prob­lem as if it oc­curs in an iso­lated uni­verse. (So I claim, though I can­not point to any real ex­am­ple of such a sub­jec­tive Bayesian.)

Yet, even if the sub­jec­tive Bayesian ideal could be re­al­ized, many philoso­phers of sci­ence (e.g. Deb­o­rah Mayo) would con­sider it just as am­bigu­ous as non-Bayesian ap­proaches, since even if you have an un­am­bigu­ous proec­dure for form­ing per­sonal pri­ors, your pri­ors are still go­ing to differ from mine. I don’t con­sider this a defect, since my wor­ld­view nec­es­sar­ily does differ from yours. My ul­ti­mate goal is to make the best de­ci­sion for my­self. That said, such ego­cen­trism, even if ra­tio­nally mo­ti­vated, may in­deed be poorly suited for a col­lab­o­ra­tive en­ter­prise like sci­ence.

For me, the most far more trou­ble­some ob­jec­tion to the “Bayesian dream” is the ques­tion, “How would ac­tu­ally you go about con­struct­ing this prior that rep­re­sents all of your be­liefs?” Look­ing in the Bayesian liter­a­ture, one does not find any con­vinc­ing ex­am­ples of any user of Bayesian in­fer­ence man­ag­ing to ac­tu­ally en­code all (or even a tiny por­tion) of their be­liefs in the form of the prior—in fact, for the most part, we see alarm­ingly lit­tle thought or jus­tifi­ca­tion be­ing put into the con­struc­tion of the pri­ors.

Nev­er­the­less, I my­self re­mained one of these “hard­core Bayesi­ans”, at least from a philo­soph­i­cal point of view, ever since I started learn­ing about statis­tics. My faith in the “Bayesian dream” per­sisted even af­ter spend­ing three years in the Ph. D. pro­gram in Stan­ford (a de­part­ment with a heavy bias to­wards Fre­quen­tism) and even af­ter I per­son­ally started do­ing re­search in fre­quen­tist meth­ods. (I see fre­quen­tist in­fer­ence as a poor man’s ap­prox­i­ma­tion for the ideal Bayesian in­fer­ence.) Though I was aware of the Bayesian non-con­sis­tency re­sults, I largely dis­missed them as math­e­mat­i­cal patholo­gies. And while we were still a long way from achiev­ing uni­ver­sal in­fer­ence, I held the op­ti­mistic view that im­proved tech­nol­ogy and the­ory might one day fi­nally make the “Bayesian dream” achiev­able. How­ever, I could not find a way to ig­nore one par­tic­u­lar ex­am­ple on Wasser­man’s blog[3], due to its rele­vance to very prac­ti­cal prob­lems in causal in­fer­ence. Even­tu­ally I thought of an even sim­pler coun­terex­am­ple, which dev­as­tated my faith in the pos­si­bil­ity of con­struct­ing a uni­ver­sal prior. Per­haps a fel­low Bayesian can find a solu­tion to this quag­mire, but I am not hold­ing my breath.

The root of the prob­lem is the ex­treme de­gree of ig­no­rance we have about our world, the de­gree of sur­pris­ing­ness of many true sci­en­tific dis­cov­er­ies, and the rel­a­tive ease with which we ac­cept these sur­prises. If we con­sider this be­hav­ior ra­tio­nal (which I do), then the sub­jec­tive Bayesian is obli­gated to con­struct a prior which cap­tures this be­hav­ior. Yet, the di­ver­sity of pos­si­ble sur­prises the model must be able to ac­com­mo­date makes it prac­ti­cally im­pos­si­ble (if not math­e­mat­i­cally im­pos­si­ble) to con­struct such a prior. The al­ter­na­tive is to re­ject all pos­si­bil­ity of sur­prise, and re­fuse to up­date any faster than a uni­ver­sal prior would (ex­tremely slowly), which strikes me as a rather poor episte­molog­i­cal policy.

In the rest of the post, I’ll mo­ti­vate my ex­am­ple, sketch out a few math­e­mat­i­cal de­tails (ex­plain­ing them as best I can to a gen­eral au­di­ence), then dis­cuss the im­pli­ca­tions.

In­tro­duc­tion: Cancer classification

Biol­ogy and medicine are cur­rently adapt­ing to the wealth of in­for­ma­tion we can ob­tain by us­ing high-through­put as­says: tech­nolo­gies which can rapidly read the DNA of an in­di­vi­d­ual, mea­sure the con­cen­tra­tion of mes­sen­ger RNA, metabo­lites, and pro­teins. In the early days of this “large-scale” ap­proach to biol­ogy which be­gan with the Hu­man Genome Pro­ject, some op­ti­mists had hoped that such an un­prece­dented tor­rent of raw data would al­low sci­en­tists to quickly “crack the ge­netic code.” By now, any such op­ti­mism has been washed away by the over­whelming com­plex­ity and un­cer­tainty of hu­man biol­ogy—a com­plex­ity which has been made clearer than ever by the flood of data—and re­placed with a sober ap­pre­ci­a­tion that in the new “big data” paradigm, mak­ing a dis­cov­ery be­comes a much eas­ier task than un­der­stand­ing any of those dis­cov­er­ies.

En­ter the ap­pli­ca­tion of ma­chine learn­ing to this large-scale biolog­i­cal data. Scien­tists take these mas­sive datasets con­tain­ing pa­tient out­comes, de­mo­graphic char­ac­ter­is­tics, and high-di­men­sional ge­netic, neu­rolog­i­cal, and metabolic data, and an­a­lyze them us­ing al­gorithms like sup­port vec­tor ma­chines, lo­gis­tic re­gres­sion and de­ci­sion trees to learn pre­dic­tive mod­els to re­late key biolog­i­cal vari­ables, “bio­mark­ers”, to out­comes of in­ter­est.

To give a spe­cific ex­am­ple, take a look at this ab­stract from the Shipp. et. al. pa­per on de­tect­ing sur­vival rates for can­cer pa­tients [4]:

Diffuse large B-cell lym­phoma (DLBCL), the most com­mon lym­phoid ma­lig­nancy in adults, is cur­able in less than 50% of pa­tients. Prog­nos­tic mod­els based on pre-treat­ment char­ac­ter­is­tics, such as the In­ter­na­tional Prog­nos­tic In­dex (IPI), are cur­rently used to pre­dict out­come in DLBCL. How­ever, clini­cal out­come mod­els iden­tify nei­ther the molec­u­lar ba­sis of clini­cal het­ero­gene­ity, nor spe­cific ther­a­peu­tic tar­gets. We an­a­lyzed the ex­pres­sion of 6,817 genes in di­ag­nos­tic tu­mor spec­i­mens from DLBCL pa­tients who re­ceived cy­clophos­phamide, adri­amycin, vin­cristine and pred­ni­sone (CHOP)-based chemother­apy, and ap­plied a su­per­vised learn­ing pre­dic­tion method to iden­tify cured ver­sus fatal or re­frac­tory dis­ease. The al­gorithm clas­sified two cat­e­gories of pa­tients with very differ­ent five-year over­all sur­vival rates (70% ver­sus 12%). The model also effec­tively delineated pa­tients within spe­cific IPI risk cat­e­gories who were likely to be cured or to die of their dis­ease. Genes im­pli­cated in DLBCL out­come in­cluded some that reg­u­late re­sponses to B-cell−re­cep­tor sig­nal­ing, crit­i­cal ser­ine/​thre­o­nine phos­pho­ry­la­tion path­ways and apop­to­sis. Our data in­di­cate that su­per­vised learn­ing clas­sifi­ca­tion tech­niques can pre­dict out­come in DLBCL and iden­tify ra­tio­nal tar­gets for in­ter­ven­tion.

The term “su­per­vised learn­ing” refers to any al­gorithm for learn­ing a pre­dic­tive model for pre­dict­ing some out­come Y(could be ei­ther cat­e­gor­i­cal or nu­meric) from co­vari­ates or fea­tures X. In this par­tic­u­lar pa­per, the au­thors used a rel­a­tively sim­ple lin­ear model (which they called “weighted vot­ing”) for pre­dic­tion.

A lin­ear model is fairly easy to in­ter­pret: it pro­duces a sin­gle “score vari­able” via a weighted av­er­age of a num­ber of pre­dic­tor vari­ables. Then it pre­dicts the out­come (say “sur­vival” or “no sur­vival”) based on a rule like, “Pre­dict sur­vival if the score is larger than 0.” Yet, far more ad­vanced ma­chine learn­ing mod­els have been de­vel­oped, in­clud­ing “deep neu­ral net­works” which are win­ning all of the image recog­ni­tion and ma­chine trans­la­tion com­pe­ti­tions at the mo­ment. Th­ese “deep neu­ral net­works” are es­pe­cially no­to­ri­ous for be­ing difficult to in­ter­pret. Along with similarly com­pli­cated mod­els, neu­ral net­works are of­ten called “black box mod­els”: al­though you can get mirac­u­lously ac­cu­rate an­swers out of the “box”, peer­ing in­side won’t give you much of a clue as to how it ac­tu­ally works.

Now it is time for the first thought ex­per­i­ment. Sup­pose a fol­low-up pa­per to the Shipp pa­per re­ports dra­mat­i­cally im­proved pre­dic­tion for sur­vival out­comes of lym­phoma pa­tients. The au­thors of this fol­low-up pa­per trained their model on a “train­ing sam­ple” of 500 pa­tients, then used it to pre­dict the five-year out­come of chemother­apy pa­tients, on a “test sam­ple” of 1000 pa­tients. It cor­rectly pre­dicts the out­come (“sur­vival” vs “no sur­vival”) on 990 of the 1000 pa­tients.

Ques­tion 1: what is your opinion on the pre­dic­tive ac­cu­racy of this model on the pop­u­la­tion of chemother­apy pa­tients? Sup­pose that pub­li­ca­tion bias is not an is­sue (the au­thors of this pa­per de­signed the study in ad­vance and com­mit­ted to pub­lish­ing) and sup­pose that the test sam­ple of 1000 pa­tients is “rep­re­sen­ta­tive” of the en­tire pop­u­la­tion of chemother­apy pa­tients.

Ques­tion 2: does your judg­ment de­pend on the com­plex­ity of the model they used? What if the au­thors used an ex­tremely com­plex and coun­ter­in­tu­itive model, and can­not even offer any jus­tifi­ca­tion or ex­pla­na­tion for why it works? (Nev­er­the­less, their peers have in­de­pen­dently con­firmed the pre­dic­tive ac­cu­racy of the model.)

A Fre­quen­tist approach

The Fre­quen­tist an­swer to the thought ex­per­i­ment is as fol­lows. The ac­cu­racy of the model is a prob­a­bil­ity p which we wish to es­ti­mate. The num­ber of suc­cesses on the 1000 test pa­tients is Bino­mial(p, 1000). Based on the data, one can con­struct a con­fi­dence in­teral: say, we are 99% con­fi­dent that the ac­cu­racy is above 83%. What does 99% con­fi­dent mean? I won’t try to ex­plain, but sim­ply say that in this par­tic­u­lar situ­a­tion, “I’m pretty sure” that the ac­cu­racy of the model is above 83%.

A Bayesian approach

The Bayesian in­ter­jects, “Hah! You can’t ex­plain what your con­fi­dence in­ter­val ac­tu­ally means!” He puts a uniform prior on the prob­a­bil­ity p. The pos­te­rior dis­tri­bu­tion of p, con­di­tional on the data, is Beta(991, 11). This gives a 99% cred­ible in­ter­val that p is in [0.978, 0.995]. You can ac­tu­ally in­ter­pret the in­ter­val in prob­a­bil­is­tic terms, and it gives a much tighter in­ter­val as well. Seems like a Bayesian vic­tory...?

A sub­jec­tive Bayesian approach

As I have ar­gued be­fore, a Bayesian ap­proach which comes up with a model af­ter hear­ing about the prob­lem is bound to suffer from the same in­con­sis­tency and ar­bitari­ness as any non-Bayesian ap­proach. You might as­sume a uniform dis­tri­bu­tion for p in this prob­lem… but yet an­other pa­per comes along with a similar pre­dic­tion model? You would need a join dis­tri­bu­tion for the cur­rent model and the new model. What if a the­ory comes along that could help ex­plain the suc­cess of the cur­rent method? The pa­ram­e­ter p might take a new mean­ing in this con­text.

So as a sub­jec­tive Bayesian, I ar­gue that slap­ping a uniform prior on the ac­cu­racy is the wrong ap­proach. But I’ll stop short of ac­tu­ally con­struct­ing a Bayesian model of the en­tire world: let’s say we want to re­strict our at­ten­tion to this par­tic­u­lar is­sue of can­cer pre­dic­tion. We want to model the dy­nam­ics be­hind can­cer and can­cer treat­ment in hu­mans. Need­less to say, the model is still ridicu­lously com­pli­cated. How­ever, I don’t think it’s out of reach of the efforts of a well-funded, large col­lab­o­ra­tive effort of sci­en­tists.

Roughly speak­ing, the model can be di­vided into a dis­tri­bu­tion over the­o­ries of hu­man biol­ogy, and con­di­tional on the the­ory of biol­ogy, a course-grained model of an in­di­vi­d­ual pa­tient. The model would not in­clude ev­ery cell, ev­ery molecule, etc., but it would con­tain many la­tent vari­ables in ad­di­tion to the vari­ables mea­sured in any par­tic­u­lar can­cer study. Let’s call the vari­ables ac­tu­ally mea­sured in the study, X, and also the sur­vival out­come, Y.

Now here is the episte­molog­i­cally cor­rect way to an­swer the thought ex­per­i­ment. Take a look at the X’s and Y’s of the pa­tients in the train­ing and test set. Up­date your prob­a­bil­is­tic model of hu­man biol­ogy based on the data. Then take a look at the ac­tual form of the clas­sifier: it’s a func­tion f() map­ping X’s to Y’s. The ac­cu­racy of the class­sifer is no longer pa­ram­e­ter: it’s a quan­tity Pr[f(X) = Y] which has a dis­tri­bu­tion un­der your pos­te­rior. That is, for any given “the­ory of hu­man biol­ogy”, Pr[f(X) = Y] has a fixed value: now, over the dis­tri­bu­tion of pos­si­ble the­o­ries of hu­man biol­ogy (based on the data of the cur­rent study as well as all pre­vi­ous stud­ies and your own be­liefs) Pr[f(X) = Y] has a dis­tri­bu­tion, and there­fore, an av­er­age. But what will this pos­te­rior give you? Will you get some­thing similar to the in­ter­val [0.978, 0.995] you got from the “prac­ti­cal Bayes” ap­proach?

Who knows? But I would guess in all like­li­hood not. My guess you would get a very differ­ent in­ter­val from [0.978, 0.995], be­cause in this com­plex model there is no di­rect link from the em­piri­cal suc­cess rate of pre­dic­tion, and the quan­tity Pr[f(X) = Y]. But my in­tu­ition for this fact comes from the fol­low­ing sim­pler frame­work.

A non-para­met­ric Bayesian approach

In­stead of rea­son­ing about a gand Bayesian model of biol­ogy, I now take a mid­dle ground, and sug­gest­ing that while we don’t need to cap­ture the en­tire la­tent dy­nam­ics of can­cer, we should at the very least we should try to in­clude the X’s and the Y’s in the model, in­stead of merely ab­stract­ing the whole ex­per­i­ment as a Bino­mial trial (as did the fre­quen­tist and prag­matic Bayesian.) Hence we need a prior over joint dis­tri­bu­tions of (X, Y). And yes, I do mean a prior dis­tri­bu­tion over prob­a­bil­ity dis­tri­bu­tions: we are say­ing that (X, Y) has some un­known joint dis­tri­bu­tion, which we treat as be­ing drawn at ran­dom from a large col­lec­tion of dis­tri­bu­tions. This is there­fore a non-para­met­ric Bayes ap­proach: the term non-para­met­ric means that the num­ber of the pa­ram­e­ters in the model is not finite.

Since in this case Y is a bi­nary out­come, a joint dis­tri­bu­tion can be de­com­posed as a marginal dis­tri­bu­tion over X, and a func­tion g(x) giv­ing the con­di­tional prob­a­bil­ity that Y=1 given X=x. The marginal dis­tri­bu­tion is not so in­ter­est­ing or im­por­tant for us, since it sim­ple re­flects the com­po­si­tion of the pop­u­la­tion of pa­tients. For the pur­pose of this ex­am­ple, let us say that the marginal is known (e.g., a finite dis­tri­bu­tion over the pop­u­la­tion of US can­cer pa­tients). What we want to know is the prob­a­bil­ity of pa­tient sur­vival, and this is given by the func­tion g(X) for the par­tic­u­lar pa­tient’s X. Hence, we will mainly deal with con­struct­ing a prior over g(X).

To con­struct a prior, we need to think of in­tu­itive prop­er­ties of the sur­vival prob­a­bil­ity func­tion g(x). If x is similar to x’, then we ex­pect the sur­vival prob­a­bil­ities to be similar. Hence the prior on g(x) should be over ran­dom, smooth func­tions. But we need to choose the smooth­ness so that the prior does not con­sist of al­most-con­stant func­tions. Sup­pose for now that we choose par­tic­u­lar class of smooth func­tions (e.g. func­tions with a cer­tain Lip­s­chitz norm) and choose our prior to to be uniform over func­tions of that smooth­ness. We could go fur­ther and put a prior on the smooth­ness hy­per­pa­ram­e­ter, but for now we won’t.

Now, al­though I as­sert my faith­ful­ness to the Bayesian ideal, I still want to think about how what­ever prior we choose would al­low use to an­swer some sim­ple though ex­per­i­ments. Why is that? I hold that the ideal Bayesian in­fer­ence should cap­ture and re­fine what I take to be “ra­tio­nal be­hav­ior.” Hence, if a prior pro­duces ir­ra­tional out­comes, I re­ject that prior as not re­flect­ing my be­liefs.

Take the fol­low­ing thought ex­per­i­ment: we sim­ply want to es­ti­mate the ex­pected value of Y, E[Y]. Hence, we draw 100 pa­tients in­de­pen­dently with re­place­ment from the pop­u­la­tion and record their out­comes: sup­pose the sum is 80 out of 100. The Fre­quen­tist (and prga­matic Bayesian) would end up con­clud­ing that with high prob­a­bil­ity/​con­fi­dence/​what­ever, the ex­pected value of Y is around 0.8, and I would hold that an ideal ra­tio­nal­ist come up with a similar be­lief. But what would our non-para­met­ric model say? We would draw a ran­dom func­tion g(x) con­di­tional on our par­tic­u­lar ob­ser­va­tions: we get a quan­tity E[g(X)] for each in­stan­ti­a­tion of g(x): the dis­tri­bu­tion of E[g(X)]’s over the pos­te­rior al­lows us to make cred­ible in­ter­vals for E[Y].

But what do we end up get­ting? Either one of two things hap­pens. Either you choose too lit­tle smooth­ness, and E[g(X)] ends up con­cen­trat­ing at around 0.5, no mat­ter what data you put into the model. This is the phe­nomenon of Bayesian non-con­sis­tency, and a de­tailed ex­pla­na­tion can be found in sev­eral of the listed refer­ences: but to put it briefly, sam­pling at a few iso­lated points gives you too lit­tle in­for­ma­tion on the rest of the func­tion. This ex­am­ple is not as patholog­i­cal as the ones used in the liter­a­ture: if you sam­ple in­finitely many points, you will even­tu­ally get the pos­te­rior to con­cen­trate around the true value of E[Y], but all the same, the con­ver­gence is ridicu­lously slow. Alter­na­tively, use a su­per-high smooth­ness, and the pos­te­rior of E[g(X)] has a nice in­ter­val around the sam­ple value just like in the Bino­mial ex­am­ple. But now if you look at your pos­te­rior draws of g(x), you’ll no­tice the func­tions are ba­si­cally con­stants. Put­ting a prior on smooth­ness doesn’t change things: the pos­te­rior on smooth­ness doesn’t change, since you don’t ac­tu­ally have enough data to de­ter­mine the smooth­ness of the func­tion. The pos­te­rior av­er­age of E[g(X)] is no longer always 0.5: it gets a lit­tle bit af­fected by the data, since within the 10% mass of the pos­te­rior cor­re­spond­ing to the smooth prior, the av­er­age of E[g(X)] is re­spond­ing to the data. But you are still al­most as slow as be­fore in con­verg­ing to the truth.

At the time that I started think­ing about the above “uniform sam­pling” ex­am­ple, I was stil con­vinced of a Bayesian re­s­olu­tion. Ob­vi­ously, us­ing a uniform prior over smooth func­tions is too naive: you can tell by see­ing that the prior dis­tri­bu­tion over E[g(X)] is already highly con­cen­trated around 0.5. How about a hi­er­ar­chi­cal model, where first we draw a pa­ram­e­ter p from the uniform dis­tri­bu­tion, and then draw g(x) from the uniform dis­tri­bu­tion over smooth func­tions with mean value equal to p? This gets you non-con­stant g(x) in the pos­te­rior, while your pos­te­ri­ors of E[g(X)] con­verge to the truth as quickly as in the Bino­mial ex­am­ple. Ar­gu­ing back­wards, I would say that such a prior comes closer to cap­tur­ing my be­liefs.

But then I thought, what about more com­pli­cated prob­lems than com­put­ing E[Y]? What if you have to com­pute the ex­pec­ta­tion of Y con­di­tional on some com­pli­cated func­tion of X tak­ing on a cer­tain value: i.e. E[Y|f(X) = 1]? In the fre­quen­tist world, you can eas­ily com­pute E[Y|f(X)=1] by re­jec­tion sam­pling: get a sam­ple of in­di­vi­d­u­als, av­er­age the Y’s of the in­di­vi­d­u­als whose X’s satisfy f(X) = 1. But how could you for­mu­late a prior that has the same prop­erty? For a finite col­lec­tion of func­tions f, {f1,...,f100}, say, you might be able to con­struct a prior for g(x) so that the pos­te­rior for E[g(X)|fi = 1] con­verges to the truth for ev­ery i in {1,..,100}. I don’t know how to do so, but per­haps you know. But the fre­quen­tist in­ter­vals work for ev­ery func­tion f! Can you con­struct a prior which can do the same?

I am happy to ar­gue that a true Bayesian would not need con­sis­tency for ev­ery pos­si­ble f in the math­e­mat­i­cal uni­verse. It is cool that fre­quen­tist in­fer­ence works for such a gen­eral col­lec­tion: but it may well be un­nec­es­sary for the world we live in. In other words, there may be func­tions f which are so ridicu­lous, that even if you showed me that em­piri­cally, E[Y|f(X)=1] = 0.9, based on data from 1 mil­lion pa­tients, I would not be­lieve that E[Y|f(X)=1] was close to 0.9. It is a coun­ter­in­tu­itive con­clu­sion, but one that I am pre­pared to ac­cept.

Yet, the set of f’s which are not so ridicu­lous, which in fact I might ac­cept to be rea­son­able based on con­ven­tional sci­ence, may be so large as to ren­der im­pos­si­ble the con­struc­tion of a prior which could ac­com­mo­date them all. But the Bayesian dream makes the far stronger de­mand that our prior cap­ture not just our cur­rent un­der­stand­ing of sci­ence but to match the flex­i­bil­ity of ra­tio­nal thought. I hold that given the ap­pro­pri­ate ev­i­dence, ra­tio­nal­ists can be per­suaded to ac­cept truths which they could not even imag­ine be­fore­hand. Think­ing about how we could pos­si­bly con­struct a prior to mimic this be­hav­ior, the Bayesian dream seems dis­tant in­deed.


To be up­dated later… per­haps re­spond­ing to some of your com­ments!

[1] Di­a­co­nis and Freed­man, “On the Con­sis­tency of Bayes Es­ti­mates”

[2] ET Jaynes, Prob­a­bil­ity: the Logic of Science

[3] https://​​normalde­vi­ate.word­​​2012/​​08/​​28/​​robins-and-wasser­man-re­spond-to-a-no­bel-prize-win­ner/​​

[4] Shipp et al. “Diffuse large B-cell lym­phoma out­come pre­dic­tion by gene-ex­pres­sion pro­filing and su­per­vised ma­chine learn­ing.” Nature