Beautiful Probability

Should we ex­pect ra­tio­nal­ity to be, on some level, sim­ple? Should we search and hope for un­der­ly­ing beauty in the arts of be­lief and choice?

Let me in­tro­duce this is­sue by bor­row­ing a com­plaint of the late great Bayesian Master, E. T. Jaynes (1990):

“Two med­i­cal re­searchers use the same treat­ment in­de­pen­dently, in differ­ent hos­pi­tals. Nei­ther would stoop to falsify­ing the data, but one had de­cided be­fore­hand that be­cause of finite re­sources he would stop af­ter treat­ing N=100 pa­tients, how­ever many cures were ob­served by then. The other had staked his rep­u­ta­tion on the effi­cacy of the treat­ment, and de­cided he would not stop un­til he had data in­di­cat­ing a rate of cures definitely greater than 60%, how­ever many pa­tients that might re­quire. But in fact, both stopped with ex­actly the same data: n = 100 [pa­tients], r = 70 [cures]. Should we then draw differ­ent con­clu­sions from their ex­per­i­ments?” (Pre­sum­ably the two con­trol groups also had equal re­sults.)

Ac­cord­ing to old-fash­ioned statis­ti­cal pro­ce­dure—which I be­lieve is still be­ing taught to­day—the two re­searchers have performed differ­ent ex­per­i­ments with differ­ent stop­ping con­di­tions. The two ex­per­i­ments could have ter­mi­nated with differ­ent data, and there­fore rep­re­sent differ­ent tests of the hy­poth­e­sis, re­quiring differ­ent statis­ti­cal analy­ses. It’s quite pos­si­ble that the first ex­per­i­ment will be “statis­ti­cally sig­nifi­cant”, the sec­ond not.

Whether or not you are dis­turbed by this says a good deal about your at­ti­tude to­ward prob­a­bil­ity the­ory, and in­deed, ra­tio­nal­ity it­self.

Non-Bayesian statis­ti­ci­ans might shrug, say­ing, “Well, not all statis­ti­cal tools have the same strengths and weak­nesses, y’know—a ham­mer isn’t like a screw­driver—and if you ap­ply differ­ent statis­ti­cal tools you may get differ­ent re­sults, just like us­ing the same data to com­pute a lin­ear re­gres­sion or train a reg­u­larized neu­ral net­work. You’ve got to use the right tool for the oc­ca­sion. Life is messy—”

And then there’s the Bayesian re­ply: “Ex­cuse you? The ev­i­den­tial im­pact of a fixed ex­per­i­men­tal method, pro­duc­ing the same data, de­pends on the re­searcher’s pri­vate thoughts? And you have the nerve to ac­cuse us of be­ing ‘too sub­jec­tive’?”

If Na­ture is one way, the like­li­hood of the data com­ing out the way we have seen will be one thing. If Na­ture is an­other way, the like­li­hood of the data com­ing out that way will be some­thing else. But the like­li­hood of a given state of Na­ture pro­duc­ing the data we have seen, has noth­ing to do with the re­searcher’s pri­vate in­ten­tions. So what­ever our hy­pothe­ses about Na­ture, the like­li­hood ra­tio is the same, and the ev­i­den­tial im­pact is the same, and the pos­te­rior be­lief should be the same, be­tween the two ex­per­i­ments. At least one of the two Old Style meth­ods must dis­card rele­vant in­for­ma­tion—or sim­ply do the wrong calcu­la­tion—for the two meth­ods to ar­rive at differ­ent an­swers.

The an­cient war be­tween the Bayesi­ans and the ac­cursèd fre­quen­tists stretches back through decades, and I’m not go­ing to try to re­count that el­der his­tory in this blog post.

But one of the cen­tral con­flicts is that Bayesi­ans ex­pect prob­a­bil­ity the­ory to be… what’s the word I’m look­ing for? “Neat?” “Clean?” “Self-con­sis­tent?”

As Jaynes says, the the­o­rems of Bayesian prob­a­bil­ity are just that, the­o­rems in a co­her­ent proof sys­tem. No mat­ter what deriva­tions you use, in what or­der, the re­sults of Bayesian prob­a­bil­ity the­ory should always be con­sis­tent—ev­ery the­o­rem com­pat­i­ble with ev­ery other the­o­rem.

If you want to know the sum of 10 + 10, you can re­define it as (2 * 5) + (7 + 3) or as (2 * (4 + 6)) or use what­ever other le­gal tricks you like, but the re­sult always has to come out to be the same, in this case, 20. If it comes out as 20 one way and 19 the other way, then you may con­clude you did some­thing ille­gal on at least one of the two oc­ca­sions. (In ar­ith­metic, the ille­gal op­er­a­tion is usu­ally di­vi­sion by zero; in prob­a­bil­ity the­ory, it is usu­ally an in­finity that was not taken as a the limit of a finite pro­cess.)

If you get the re­sult 19 = 20, look hard for that er­ror you just made, be­cause it’s un­likely that you’ve sent ar­ith­metic it­self up in smoke. If any­one should ever suc­ceed in de­riv­ing a real con­tra­dic­tion from Bayesian prob­a­bil­ity the­ory—like, say, two differ­ent ev­i­den­tial im­pacts from the same ex­per­i­men­tal method yield­ing the same re­sults—then the whole ed­ifice goes up in smoke. Along with set the­ory, ’cause I’m pretty sure ZF pro­vides a model for prob­a­bil­ity the­ory.

Math! That’s the word I was look­ing for. Bayesi­ans ex­pect prob­a­bil­ity the­ory to be math. That’s why we’re in­ter­ested in Cox’s The­o­rem and its many ex­ten­sions, show­ing that any rep­re­sen­ta­tion of un­cer­tainty which obeys cer­tain con­straints has to map onto prob­a­bil­ity the­ory. Co­her­ent math is great, but unique math is even bet­ter.

And yet… should ra­tio­nal­ity be math? It is by no means a fore­gone con­clu­sion that prob­a­bil­ity should be pretty. The real world is messy—so shouldn’t you need messy rea­son­ing to han­dle it? Maybe the non-Bayesian statis­ti­ci­ans, with their vast col­lec­tion of ad-hoc meth­ods and ad-hoc jus­tifi­ca­tions, are strictly more com­pe­tent be­cause they have a strictly larger toolbox. It’s nice when prob­lems are clean, but they usu­ally aren’t, and you have to live with that.

After all, it’s a well-known fact that you can’t use Bayesian meth­ods on many prob­lems be­cause the Bayesian calcu­la­tion is com­pu­ta­tion­ally in­tractable. So why not let many flow­ers bloom? Why not have more than one tool in your toolbox?

That’s the fun­da­men­tal differ­ence in mind­set. Old School statis­ti­ci­ans thought in terms of tools, tricks to throw at par­tic­u­lar prob­lems. Bayesi­ans—at least this Bayesian, though I don’t think I’m speak­ing only for my­self—we think in terms of laws.

Look­ing for laws isn’t the same as look­ing for es­pe­cially neat and pretty tools. The sec­ond law of ther­mo­dy­nam­ics isn’t an es­pe­cially neat and pretty re­friger­a­tor.

The Carnot cy­cle is an ideal en­g­ine—in fact, the ideal en­g­ine. No en­g­ine pow­ered by two heat reser­voirs can be more effi­cient than a Carnot en­g­ine. As a corol­lary, all ther­mo­dy­nam­i­cally re­versible en­g­ines op­er­at­ing be­tween the same heat reser­voirs are equally effi­cient.

But, of course, you can’t use a Carnot en­g­ine to power a real car. A real car’s en­g­ine bears the same re­sem­blance to a Carnot en­g­ine that the car’s tires bear to perfect rol­ling cylin­ders.

Clearly, then, a Carnot en­g­ine is a use­less tool for build­ing a real-world car. The sec­ond law of ther­mo­dy­nam­ics, ob­vi­ously, is not ap­pli­ca­ble here. It’s too hard to make an en­g­ine that obeys it, in the real world. Just ig­nore ther­mo­dy­nam­ics—use what­ever works.

This is the sort of con­fu­sion that I think reigns over they who still cling to the Old Ways.

No, you can’t always do the ex­act Bayesian calcu­la­tion for a prob­lem. Some­times you must seek an ap­prox­i­ma­tion; of­ten, in­deed. This doesn’t mean that prob­a­bil­ity the­ory has ceased to ap­ply, any more than your in­abil­ity to calcu­late the aero­dy­nam­ics of a 747 on an atom-by-atom ba­sis im­plies that the 747 is not made out of atoms. What­ever ap­prox­i­ma­tion you use, it works to the ex­tent that it ap­prox­i­mates the ideal Bayesian calcu­la­tion—and fails to the ex­tent that it de­parts.

Bayesi­anism’s co­her­ence and unique­ness proofs cut both ways. Just as any calcu­la­tion that obeys Cox’s co­herency ax­ioms (or any of the many re­for­mu­la­tions and gen­er­al­iza­tions) must map onto prob­a­bil­ities, so too, any­thing that is not Bayesian must fail one of the co­herency tests. This, in turn, opens you to pun­ish­ments like Dutch-book­ing (ac­cept­ing com­bi­na­tions of bets that are sure losses, or re­ject­ing com­bi­na­tions of bets that are sure gains).

You may not be able to com­pute the op­ti­mal an­swer. But what­ever ap­prox­i­ma­tion you use, both its failures and suc­cesses will be ex­plain­able in terms of Bayesian prob­a­bil­ity the­ory. You may not know the ex­pla­na­tion; that does not mean no ex­pla­na­tion ex­ists.

So you want to use a lin­ear re­gres­sion, in­stead of do­ing Bayesian up­dates? But look to the un­der­ly­ing struc­ture of the lin­ear re­gres­sion, and you see that it cor­re­sponds to pick­ing the best point es­ti­mate given a Gaus­sian like­li­hood func­tion and a uniform prior over the pa­ram­e­ters.

You want to use a reg­u­larized lin­ear re­gres­sion, be­cause that works bet­ter in prac­tice? Well, that cor­re­sponds (says the Bayesian) to hav­ing a Gaus­sian prior over the weights.

Some­times you can’t use Bayesian meth­ods liter­ally; of­ten, in­deed. But when you can use the ex­act Bayesian calcu­la­tion that uses ev­ery scrap of available knowl­edge, you are done. You will never find a statis­ti­cal method that yields a bet­ter an­swer. You may find a cheap ap­prox­i­ma­tion that works ex­cel­lently nearly all the time, and it will be cheaper, but it will not be more ac­cu­rate. Not un­less the other method uses knowl­edge, per­haps in the form of dis­guised prior in­for­ma­tion, that you are not al­low­ing into the Bayesian calcu­la­tion; and then when you feed the prior in­for­ma­tion into the Bayesian calcu­la­tion, the Bayesian calcu­la­tion will again be equal or su­pe­rior.

When you use an Old Style ad-hoc statis­ti­cal tool with an ad-hoc (but of­ten quite in­ter­est­ing) jus­tifi­ca­tion, you never know if some­one else will come up with an even more clever tool to­mor­row. But when you can di­rectly use a calcu­la­tion that mir­rors the Bayesian law, you’re done—like man­ag­ing to put a Carnot heat en­g­ine into your car. It is, as the say­ing goes, “Bayes-op­ti­mal”.

It seems to me that the toolbox­ers are look­ing at the se­quence of cubes {1, 8, 27, 64, 125, …} and point­ing to the first differ­ences {7, 19, 37, 61, …} and say­ing “Look, life isn’t always so neat—you’ve got to adapt to cir­cum­stances.” And the Bayesi­ans are point­ing to the third differ­ences, the un­der­ly­ing sta­ble level {6, 6, 6, 6, 6, …}. And the crit­ics are say­ing, “What the heck are you talk­ing about? It’s 7, 19, 37 not 6, 6, 6. You are over­sim­plify­ing this messy prob­lem; you are too at­tached to sim­plic­ity.”

It’s not nec­es­sar­ily sim­ple on a sur­face level. You have to dive deeper than that to find sta­bil­ity.

Think laws, not tools. Need­ing to calcu­late ap­prox­i­ma­tions to a law doesn’t change the law. Planes are still atoms, they aren’t gov­erned by spe­cial ex­cep­tions in Na­ture for aero­dy­namic calcu­la­tions. The ap­prox­i­ma­tion ex­ists in the map, not in the ter­ri­tory. You can know the sec­ond law of ther­mo­dy­nam­ics, and yet ap­ply your­self as an en­g­ineer to build an im­perfect car en­g­ine. The sec­ond law does not cease to be ap­pli­ca­ble; your knowl­edge of that law, and of Carnot cy­cles, helps you get as close to the ideal effi­ciency as you can.

We aren’t en­chanted by Bayesian meth­ods merely be­cause they’re beau­tiful. The beauty is a side effect. Bayesian the­o­rems are el­e­gant, co­her­ent, op­ti­mal, and prov­ably unique be­cause they are laws.

Ad­den­dum: Cyan di­rects us to chap­ter 37 of MacKay’s ex­cel­lent statis­tics book, free on­line, for a more thor­ough ex­pla­na­tion of the open­ing prob­lem.

Jaynes, E. T. (1990.) Prob­a­bil­ity The­ory as Logic. In: P. F. Fougere (Ed.), Max­i­mum En­tropy and Bayesian Meth­ods. Kluwer Aca­demic Pub­lish­ers.

MacKay, D. (2003.) In­for­ma­tion The­ory, In­fer­ence, and Learn­ing Al­gorithms. Cam­bridge: Cam­bridge Univer­sity Press.