The Objective Bayesian Programme

Fol­lowup to: Bayesian Flame.

This post is a chron­i­cle of my at­tempts to un­der­stand Cyan’s #2. (Bayesian Flame was an ap­prox­i­mate parse of #1.) Warn­ing: long, some math, lots of links, prob­a­bly lots of er­rors. At the very least I want this to serve as a good refer­ence for fur­ther read­ing.

Introduction

To the math­e­mat­i­cal eye, many statis­ti­cal prob­lems share the fol­low­ing min­i­mal struc­ture:

  1. A space of pa­ram­e­ters. (Imag­ine a freeform blob with­out as­sum­ing any met­ric or mea­sure.)

  2. A space of pos­si­ble out­comes. (Imag­ine an­other, similarly un­struc­tured blob.)

  3. Each point in the pa­ram­e­ter space de­ter­mines a prob­a­bil­ity mea­sure on the out­come space.

By it­self, this kind of in­put is too sparse to yield solu­tions to statis­ti­cal prob­lems. What ad­di­tional struc­ture on the spaces should we in­tro­duce?

The an­swer that we all know and love

As­sum­ing some “prior” prob­a­bil­ity mea­sure on the pa­ram­e­ter space yields a solu­tion that’s unique, con­sis­tent and won­der­ful in all sorts of ways. This has led some peo­ple to adopt the “sub­jec­tivist” po­si­tion say­ing pri­ors are so ba­sic that they ought not be ques­tioned. One of its most promi­nent defen­ders was Leonard Jim­mie Sav­age who put for­ward the fol­low­ing ar­gu­ment:

Sup­pose, for ex­am­ple, that the per­son is offered an even-money bet for five dol­lars—or, to be ul­tra-rigor­ous, for five utiles—that in­ter­nal com­bus­tion en­g­ines in Amer­i­can au­to­mo­biles will be ob­so­lete by 1970. If there is any event to which an ob­jec­tivist would re­fuse to at­tach prob­a­bil­ity, that cor­re­spond­ing to the ob­so­les­cence in ques­tion is one… Yet, I think I may say with­out pre­sump­tion that you would re­gard the bet against ob­so­les­cence as a very sound in­vest­ment.

This is a fine ar­gu­ment for us­ing pri­ors when you’re bet­ting money, but there’s a snag: how­ever much you are will­ing to bet, this doesn’t give you grounds to pub­lish pa­pers about the fu­ture that you in­ferred from your in­tu­itive prior! Any apri­ori in­for­ma­tion used in sci­ence should be jus­tified for sci­en­tific ob­jec­tivity.

(At this point Eliezer raises the sug­ges­tion that sci­en­tists ought to com­mu­ni­cate with like­li­hood ra­tios only. That might be a brave new world to live in; too bad we’ll have to stop teach­ing kids that g ap­prox­i­mately equals 9.8 m/​s2 and give them like­li­hood pro­files in­stead.)

Rather than dive deeper into the fas­ci­nat­ing topic of “un­in­for­ma­tive pri­ors”, let’s go back to the sur­face. Take a closer look at the ba­sic for­mu­la­tion above to see what other struc­tures we can in­tro­duce in­stead of pri­ors to get in­ter­est­ing re­sults.

The min­i­max approach

In the mid-20th cen­tury a statis­ti­cian named Abra­ham Wald made a vali­ant effort to step out­side the prob­lem. His de­ci­sion the­ory idea en­com­passes both fre­quen­tist and Bayesian in­fer­ence. Roughly, it goes like this: we no longer know our prior prob­a­bil­ities, but we do know our util­ities. More con­cretely, we com­pute a de­ci­sion from the ob­served dataset, and later suffer a loss that de­pends on our de­ci­sion and the ac­tual true pa­ram­e­ter value. Sub­sti­tut­ing differ­ent “spaces of de­ci­sions” and “loss func­tions”, we get a wide range of situ­a­tions to study.

But wait! Doesn’t the “op­ti­mal” de­ci­sion de­pend on the prior dis­tri­bu­tion of pa­ram­e­ters as well?

Wald’s cru­cial in­sight was that… no, not nec­es­sar­ily.

If we don’t know the prior and are try­ing to be “sci­en­tifi­cally ob­jec­tive”, it makes sense to treat the prob­lem of statis­ti­cal in­fer­ence as a game. The statis­ti­cian chooses a de­ci­sion rule, Na­ture chooses a true pa­ram­e­ter value, ran­dom­ness de­ter­mines the pay­off. Since the game is zero-sum, we can rea­son­ably ex­pect it to have a min­i­max value: there’s a de­ci­sion rule that min­i­mizes the max­i­mum loss the statis­ti­cian can suffer, what­ever Na­ture may choose.

Now, as Ken Bin­more ac­cu­rately noted, in real life you don’t min­i­max un­less “your re­la­tion­ship with the uni­verse has reached such a low ebb that you keep your pants up with both belt and sus­penders”, so the min­i­max prin­ci­ple gives off a whiff of the para­noia that we’ve come to as­so­ci­ate with fre­quen­tism. Haha, gotcha! Wald’s re­sults ap­ply to Bayesi­anism just as well. His “com­plete class the­o­rem” proves that Bayesian-ra­tio­nal strate­gies with well-defined pri­ors con­sti­tute pre­cisely the class of non-dom­i­nated strate­gies in the game de­scribed. (If you squint the right way, this last sen­tence com­presses the whole philo­soph­i­cal jus­tifi­ca­tion of Bayesi­anism.)

The game-the­o­retic ap­proach gives our Bayesian friends even more than that. The statis­ti­cal game’s min­i­max de­ci­sion rules of­ten cor­re­spond to Bayes strate­gies with a cer­tain un­in­for­ma­tive prior, called the “least fa­vor­able prior” for that risk func­tion. This gives you a fre­quen­tist-valid pro­ce­dure that also hap­pens to be Bayesian, which means im­mu­nity to Dutch books, nega­tive masses and similar crit­i­cisms. In a par­tic­u­larly fas­ci­nat­ing con­ver­gence, the well-known “refer­ence prior” (the Jeffreys prior prop­erly gen­er­al­ized to N di­men­sions) turns out to be asymp­tot­i­cally least fa­vor­able when op­ti­miz­ing the Shan­non mu­tual in­for­ma­tion be­tween the pa­ram­e­ter and the sam­ple.

At this point the Bayesi­ans in the au­di­ence should be rub­bing their hands. I told ya it would be fun! Our fre­quen­tist friends on the other hand have dozed off, so let’s pull an­other stunt to wake them up.

Con­fi­dence cov­er­age demystified

In­for­mally, we want to say things about the world like “I’m 90% sure that this phys­i­cal con­stant lies within those bounds” and be ac­tu­ally right 90% of the time when we say such things.

...Semi-for­mally, we want to a pro­ce­dure to calcu­late from each sam­ple a “con­fi­dence sub­set” of the pa­ram­e­ter space such that such sub­sets cover in­clude the true pa­ram­e­ter val­ues with prob­a­bil­ity greater or equal to 90%, while the sets them­selves are as small as pos­si­ble.

(NB: this is not equiv­a­lent to de­riv­ing a “cor­rect” pos­te­rior dis­tri­bu­tion on the pa­ram­e­ter space. Not ev­ery method of choos­ing small sub­sets with given pos­te­rior masses will give you uniformly cor­rect con­fi­dence cov­er­age, and each such method cor­re­sponds to many differ­ent pos­te­rior dis­tri­bu­tions in the N-di­men­sional case.)

...For­mally, we in­tro­duce a new struc­ture on the pa­ram­e­ter space—a “not-quite-mea­sure” to de­ter­mine the size of con­fi­dence sets—and then, upon re­ceiv­ing a sam­ple, de­ter­mine from it a 90% con­fi­dence set with the small­est pos­si­ble “not-quite-mea­sure”.

(NB: I’m call­ing it “not-quite-mea­sure” be­cause of a sub­tlety in the N-di­men­sional case. If we’re es­ti­mat­ing just one pa­ram­e­ter out of sev­eral, the “mea­sure” cor­re­sponds to span in that co­or­di­nate and thus is not ad­di­tive un­der set union, hence “not-quite”.)

Ex­cept this doesn’t work. There might be two pro­ce­dures to com­pute con­fi­dence sets, the first of which is some­times bet­ter and some­times worse than the sec­ond. We have no com­par­i­son func­tion to de­ter­mine the win­ner, and in re­al­ity the “uniformly most ac­cu­rate” pro­ce­dure doesn’t always ex­ist.

But if we re­place the “size” of the con­fi­dence set with its ex­pected size un­der each sin­gle pa­ram­e­ter value, this gives us just enough in­for­ma­tion to ap­ply the game-the­o­retic min­i­max ap­proach. Solv­ing the game thus gives us “min­i­max ex­pected size” con­fi­dence sets, or MES, that peo­ple are ac­tu­ally us­ing. Which isn’t say­ing much, but still.

More on subjectivity

The min­i­max prin­ci­ple sounds nice, but the con­struc­tion of the least fa­vor­able prior dis­tri­bu­tion for any given ex­per­i­ment and risk func­tion has a prob­lem: it typ­i­cally de­pends on the whole sam­ple space and thus on the ex­per­i­ment’s stop­ping rule. When do we stop gath­er­ing data? What sub­sets of ob­served sam­ples do we thus rule out? In the gen­eral case the least fa­vor­able prior de­pends on the num­ber of sam­ples we in­tend to draw! This blatantly vi­o­lates the like­li­hood prin­ci­ple that Eliezer so elo­quently defended.

But, or­di­nary prob­a­bil­ity the­ory tells us un­am­bigu­ously that 90% of your con­clu­sions will be true what­ever stop­ping rules you choose for each of them, as long you choose be­fore ob­serv­ing any data from the ex­per­i­ments. (Other­wise all bets are off, like if you’d de­cided to pick your Bayesian prior based on the data.) But, the con­clu­sions them­selves will be differ­ent from rule to rule. But, you can­not de­liber­ately en­g­ineer a situ­a­tion where the min­i­max of one stop­ping rule re­li­ably makes you more wrong than an­other one...

Does this look more like an eter­nal math­e­mat­i­cal law or an ad hoc tool? To me it looks like a mys­tery. Like fre­quen­tists are try­ing to solve a prob­lem that Bayesi­ans don’t even at­tempt to solve. The an­swer is some­where out there; we can guess that some­thing like to­day’s Bayesi­anism will be a big part of it, but not the only part.

Conclusion

When some field is af­flicted with deep and per­sis­tent philo­soph­i­cal con­flicts, this isn’t nec­es­sar­ily a sign that one of the sides is right and the other is just be­ing silly. It might be a sign that some cru­cial unify­ing in­sight is wait­ing sev­eral steps ahead. Min­i­max­ing doesn’t look to me like the be­gin­ning and end of “ob­jec­tive” statis­tics, but the right an­swer that we don’t know yet has got to be at least this nor­mal.

Fur­ther read­ing: James Berger, The Case for Ob­jec­tive Bayesian Anal­y­sis.