# Bayesian Flame

There once lived a great man named E.T. Jaynes. He knew that Bayesian in­fer­ence is the only way to do statis­tics log­i­cally and con­sis­tently, stand­ing on the shoulders of mi­s­un­der­stood gi­ants Laplace and Gibbs. On nu­mer­ous oc­ca­sions he van­quished tra­di­tional “fre­quen­tist” statis­ti­ci­ans with his su­pe­rior math, demon­strat­ing to any­one with half a brain how the Bayesian way gives faster and more cor­rect re­sults in each ex­am­ple. The weight of ev­i­dence falls so heav­ily on one side that it makes no sense to ar­gue any­more. The fight is over. Bayes wins. The uni­verse runs on Bayes-struc­ture.

Or at least that’s what you be­lieve if you learned this stuff from Over­com­ing Bias.

Like I was un­til two days ago, when Cyan hit me over the head with some­thing ut­terly in­com­pre­hen­si­ble. I sud­denly had to go out and un­der­stand this stuff, not just be­lieve it. (The origi­nal in­ten­tion, if I re­mem­ber it cor­rectly, was to im­press you all by pul­ling a Jaynes.) Now I’ve come back and in­tend to pro­voke a full-on flame war on the topic. Be­cause if we can have thought­ful flame wars about gen­der but not math, we’re a bad com­mu­nity. Bad, bad com­mu­nity.

If you’re like me two days ago, you kinda “un­der­stand” what Bayesi­ans do: as­sume a prior prob­a­bil­ity dis­tri­bu­tion over hy­pothe­ses, use ev­i­dence to morph it into a pos­te­rior dis­tri­bu­tion over same, and bless the re­sult­ing num­bers as your “de­grees of be­lief”. But chances are that you have a very vague idea of what fre­quen­tists do, apart from de­riv­ing half-assed re­sults with their ad hoc tools.

Well, here’s the ul­tra-short ver­sion: fre­quen­tist statis­tics is the art of draw­ing true con­clu­sions about the real world in­stead of as­sum­ing prior de­grees of be­lief and co­her­ently ad­just­ing them to avoid Dutch books.

And here’s an ul­tra-short ex­am­ple of what fre­quen­tists can do: es­ti­mate 100 in­de­pen­dent un­known pa­ram­e­ters from 100 differ­ent sam­ple data sets and have 90 of the es­ti­mates turn out to be true to fact af­ter­ward. Like, fo’real. Always 90% in the long run, truly, ir­re­vo­ca­bly and for­ever. No Bayesian method known to­day can re­li­ably do the same: the out­come will de­pend on the pri­ors you as­sume for each pa­ram­e­ter. I don’t be­lieve you’re go­ing to get lucky with all 100. And even if I be­lieved you a pri­ori (ahem) that don’t make it true.

(That’s what Jaynes did to achieve his awe­some vic­to­ries: use trained in­tu­ition to pick good pri­ors by hand on a per-sam­ple ba­sis. Maybe you can learn this skill some­where, but not from the In­tu­itive Ex­pla­na­tion.)

How in the world do you do in­fer­ence with­out a prior? Well, the char­ac­ter­i­za­tion of fre­quen­tist statis­tics as “trick­ery” is to­tally jus­tified: it has no sin­gle co­her­ent ap­proach and the tricks of­ten give con­flict­ing re­sults. Most ev­ery­body agrees that you can’t do bet­ter than Bayes if you have a clear-cut prior; but if you don’t, no one is go­ing to kick you out. We sym­pa­thize with your predica­ment and will gladly sell you some twisted tech­nol­ogy!

Con­fi­dence in­ter­vals: imag­ine you some­how pro­cess some sam­ple data to get an in­ter­val. Fur­ther imag­ine that hy­po­thet­i­cally, for any given hid­den pa­ram­e­ter value, this calcu­la­tion al­gorithm ap­plied to data sam­pled un­der that pa­ram­e­ter value yields an in­ter­val that cov­ers it with prob­a­bil­ity 90%. Believe it or not, this per­verse trick works 90% of the time with­out re­quiring any prior dis­tri­bu­tion on pa­ram­e­ter val­ues.

Un­bi­ased es­ti­ma­tors: you pro­cess the sam­ple data to get a num­ber whose ex­pec­ta­tion mag­i­cally co­in­cides with the true pa­ram­e­ter value.

Hy­poth­e­sis test­ing: I give you a black-box ran­dom dis­tri­bu­tion and claim it obeys a speci­fied for­mula. You sam­ple some data from the box and in­spect it. Fre­quen­tism al­lows you to call me a liar and be wrong no more than 10% of the time re­ject truth­ful claims no more than 10% of the time, guaran­teed, no prior in sight. (Thanks Eliezer for call­ing out the mis­take, and con­chis for the cor­rec­tion!)

But this is get­ting too aca­demic. I ought to throw you dry wood, good flame ma­te­rial. This hilar­i­ous PDF from An­drew Gel­man should do the trick. Choice quote:

Well, let me tell you some­thing. The 50 states aren’t ex­change­able. I’ve lived in a few of them and vis­ited nearly all the oth­ers, and call­ing them ex­change­able is just silly. Cal­ling it a hi­er­ar­chi­cal or mul­ti­level model doesn’t change things—it’s an ad­di­tional level of mod­el­ing that I’d rather not do. Call me old-fash­ioned, but I’d rather let the data speak with­out ap­ply­ing a prob­a­bil­ity dis­tri­bu­tion to some­thing like the 50 states which are nei­ther ran­dom nor a sam­ple.

As a bonus, the bibliog­ra­phy to that ar­ti­cle con­tains such mar­velous ti­tles as “Why Isn’t Every­one a Bayesian?” And Larry Wasser­man’s fol­lowup is also quite dis­turb­ing.

Another stick for the fire is pro­vided by Shal­izi, who (among other things) makes the cor­rect point that a good Bayesian must never be un­cer­tain about the prob­a­bil­ity of any fu­ture event. That’s why he calls Bayesi­ans “Often Wrong, Never In Doubt”:

The Bayesian, by defi­ni­tion, be­lieves in a joint dis­tri­bu­tion of the ran­dom se­quence X and of the hy­poth­e­sis M. (Other­wise, Bayes’s rule makes no sense.) This means that by in­te­grat­ing over M, we get an un­con­di­tional, marginal prob­a­bil­ity for f.

For my fi­nal quote it seems only fair to add one more polem­i­cal sum­mary of Cyan’s point that made me sit up and look around in a be­wil­dered man­ner. Credit to Wasser­man again:

Pen­ny­packer: You see, physics has re­ally ad­vanced. All those quan­tities I es­ti­mated have now been mea­sured to great pre­ci­sion. Of those thou­sands of 95 per­cent in­ter­vals, only 3 per­cent con­tained the true val­ues! They con­cluded I was a fraud.

van Nos­trand: Pen­ny­packer you fool. I never said those in­ter­vals would con­tain the truth 95 per­cent of the time. I guaran­teed co­her­ence not cov­er­age!

Pen­ny­packer: A lot of good that did me. I should have gone to that ob­jec­tive Bayesian statis­ti­cian. At least he cares about the fre­quen­tist prop­er­ties of his pro­ce­dures.

van Nos­trand: Well I’m sorry you feel that way Pen­ny­packer. But I can’t be re­spon­si­ble for your in­co­her­ent col­leagues. I’ve had enough now. Be on your way.

There’s of­ten good rea­son to ad­vo­cate a cor­rect the­ory over a wrong one. But all this ev­i­dence (ahem) shows that switch­ing to Guardian of Truth mode was, at the very least, pre­ma­ture for me. Bayes isn’t the cor­rect the­ory to make con­clu­sions about the world. As of to­day, we have no co­her­ent the­ory for mak­ing con­clu­sions about the world. Both per­spec­tives have se­ri­ous prob­lems. So do your­self a fa­vor and switch to truth-seeker mode.

• Hy­poth­e­sis test­ing: I give you a black-box ran­dom dis­tri­bu­tion and claim it obeys a speci­fied for­mula. You sam­ple some data from the box and in­spect it. Fre­quen­tism of­ten al­lows you to call me a liar and be wrong no more than 10% of the time, guaran­teed, no pri­ors in sight.

Wrong. If all black boxes do obey their speci­fied for­mu­las, then ev­ery sin­gle time you call the other per­son a liar, you will be wrong. P(wrong|”false”) ~ 1.

I’m think­ing you still haven’t quite un­der­stood here what fre­quen­tist statis­tics do.

It’s not perfectly re­li­able. They as­sume they have perfect in­for­ma­tion about ex­per­i­men­tal se­tups and like­li­hood ra­tios. (Where does this perfect knowl­edge come from? Can Bayesi­ans get their pri­ors from the same source?)

A Bayesian who wants to re­port some­thing at least as re­li­able as a fre­quen­tist statis­tic, sim­ply re­ports a like­li­hood ra­tio be­tween two or more hy­pothe­ses from the ev­i­dence; and in that mo­ment has told an­other Bayesian just what fre­quen­tists think they have perfect knowl­edge of, but sim­ply, with far less con­fu­sion and er­ror and math­e­mat­i­cal chi­canery and op­por­tu­nity for dis­tor­tion, and greater abil­ity to com­bine the re­sults of mul­ti­ple ex­per­i­ments.

And more im­por­tantly, we un­der­stand what like­li­hood ra­tios are, and that they do not be­come pos­te­ri­ors with­out adding a prior some­where.

• Thanks for the catch, struck out that part.

Yes, you can get your pri­ors from the same source they get ex­per­i­men­tal se­tups: the world. Ex­cept this source doesn’t provide pri­ors.

ETA: like­li­hood ra­tios don’t seem to com­mu­ni­cate the same info about the world as con­fi­dence in­ter­vals to me. Can you clar­ify?

• Wrong. If all black boxes do obey their speci­fied for­mu­las, then ev­ery sin­gle time you call the other per­son a liar, you will be wrong. P(wrong|”false”) ~ 1.

Ok, bear with me. cousin_it’s claim was that P(wrong|boxes-obey-for­mu­las)<=.1, am I right? I get that P(wrong|”false” & boxes-obey-for­mu­las) ~ 1, so the de­nial of cousin_it’s claim seems to re­quire P(“false”|boxes-obey-for­mu­las) > .1? I as­sumed that the point was pre­cisely that the fre­quen­tist pro­ce­dure will give you P(“false”|boxes-obey-for­mu­las)<=.1. Is that wrong?

• My claim was what Eliezer said, and it was in­cor­rect. Other than that, your com­ment is cor­rect.

• Ah, I parsed it wrongly. Whoops. Would it be worth re­plac­ing it with a cor­rected claim rather than just strik­ing it?

• Done. Thanks for the help!

• a good Bayesian must never be un­cer­tain about the prob­a­bil­ity of any fu­ture event

• Also, didn’t we already cover metauncer­tainty here?

• Shal­izi says “Bayesian agents never have the kind of un­cer­tainty that Re­bon­ato (sen­si­bly) thinks peo­ple in fi­nance should have”. My guess is that this means (some­thing that could be de­scribed as) un­cer­tainty as to how well-cal­ibrated one is, which AFAIK hasn’t been ex­plic­itly cov­ered here.

• Yup. Shal­izi’s point is that once you’ve taken meta-un­cer­tainty into ac­count (by marginal­iz­ing over it), you have a pre­cise and spe­cific prob­a­bil­ity dis­tri­bu­tion over out­comes.

• Well, yes. You have to bet at some odds. You’re in some par­tic­u­lar state of un­cer­tainty and not a differ­ent one. I sup­pose the game is to make peo­ple think that be­ing in some par­tic­u­lar state of un­cer­tainty, cor­re­sponds to claiming to know too much about the prob­lem? The ig­no­rance is shown in the in­sta­bil­ity of the es­ti­mate—the way it re­acts strongly to new ev­i­dence.

• I’m with you on this one. What Shal­izi is crit­i­ciz­ing is es­sen­tially a con­se­quence of the desider­a­tum that a sin­gle real num­ber shall rep­re­sent the plau­si­bil­ity of an event. I don’t think the meth­ods he’s ad­vo­cat­ing dis­pense with the desider­a­tum, so I view this as a deli­cious bul­let-shaped candy that he’s con­vinced is a real bul­let and is at­tempt­ing to dodge.

• I think what Shal­izi means is that a Bayesian model is never “wrong”, in the sense that it is a true de­scrip­tion of the cur­rent state of the ideal Bayesian agent’s knowl­edge. I.e., if A says an event X has prob­a­bil­ity p, and B says X has prob­a­bil­ity q, then they aren’t ly­ing even if p!=q. And the ideal Bayesian agent up­dates that knowl­edge perfectly by Bayes’ rule (where knowl­edge is defined as prob­a­bil­ity dis­tri­bu­tions of states of the world). In this case, if A and B talk with each other then they should prob­a­bly up­date, of course.

In fre­quen­tist statis­tics the paradigm is that one searches for the ‘true’ model by look­ing through a space of ‘false’ mod­els. In this case if A says X has prob­a­bil­ity p and B says X has prob­a­bil­ity q != p then at least one of them is wrong.

• Can you give a de­tailed nu­mer­i­cal ex­am­ples of some prob­lem where the Bayesian and Fre­quen­tist give differ­ent an­swers, and you feel strongly that the Fre­quen­tist’s an­swer is bet­ter some­how?

I think you’ve tried to do that, but I don’t fully un­der­stand most of your ex­am­ples. Per­haps if you used num­bers and equa­tions, that would help a lot of peo­ple un­der­stand your point. Maybe ex­pand on your “And here’s an ul­tra-short ex­am­ple of what fre­quen­tists can do” idea?

• Short an­swer: Bayesian an­swers don’t give cov­er­age guaran­tees.

Long an­swer: see the com­ments to Cyan’s post.

• “Cover­age guaran­tees” is a fre­quen­tist con­cept. Can you ex­plain where Bayesi­ans fail by Bayesian lights? In the real world, some­where?

• How about this: a Bayesian will always pre­dict that she is perfectly cal­ibrated, even though she knows the the­o­rems prov­ing she isn’t.

• A Bayesian will have a prob­a­bil­ity dis­tri­bu­tion over pos­si­ble out­comes, some of which give her lower scores than her prob­a­bil­is­tic ex­pec­ta­tion of av­er­age score, and some of which give her higher scores than this ex­pec­ta­tion.

I am un­able to parse your above claim, and ask for spe­cific math on a spe­cific ex­am­ple. If you know your score will be lower than you ex­pect, you should lower your ex­pec­ta­tion. If you know some­thing will hap­pen less of­ten than the prob­a­bil­ity you as­sign, you should as­sign a lower prob­a­bil­ity. This sounds like an in­con­sis­tent epistemic state for a Bayesian to be in.

• I spent some time look­ing up pa­pers, try­ing to find ac­cessible ones. The main pa­per that kicked off the match­ing prior pro­gram is Welch and Peers, 1963, but you need ac­cess to JSTOR.

The best I can offer is the fol­low­ing ex­am­ple. I am es­ti­mat­ing a large num­ber of pos­i­tive es­ti­mands. I have one noisy ob­ser­va­tion for each one; the noise is Gaus­sian with stan­dard de­vi­a­tion equal to one. I have no in­for­ma­tion re­lat­ing the es­ti­mands; per Jaynes, I give them in­de­pen­dent pri­ors, re­sult­ing in in­de­pen­dent pos­te­ri­ors*. I do not have in­for­ma­tion jus­tify­ing a proper prior. Let’s say I use a flat prior over the pos­i­tive real line. No mat­ter the true value of each es­ti­mand, the sam­pling prob­a­bil­ity of the event “my pos­te­rior 90% quan­tile is greater than the es­ti­mand” is less than 0.9 (see Figure 6 of this work­ing pa­per by D.A.S. Fraser). So the more es­ti­mands I an­a­lyze, the more sure I am that the in­ter­vals from 0 to my pos­te­rior 90% quan­tiles will con­tain less than 90% of the es­ti­mands.

I don’t know if there’s an ex­act match­ing prior in this prob­lem, but I sus­pect it lacks the cor­rect struc­ture.

* This is a place I think Jaynes goes wrong: the quan­tities are best mod­eled as ex­change­able, not in­de­pen­dent. Equiv­a­lently, I put them in a hi­er­ar­chi­cal model. But this only kicks the prob­lem of pri­ors guaran­tee­ing cal­ibra­tion up a level.

• I’m sorry, but the level of fre­quen­tist gib­ber­ish in this pa­per is larger than I would re­ally like to work through.

If you could be so kind, please state:

What the Bayesian is us­ing as a prior and like­li­hood func­tion;

and what dis­tri­bu­tion the pa­per as­sumes the ac­tual pa­ram­e­ters are be­ing drawn from, and what the real causal pro­cess is gov­ern­ing the ap­pear­ance of ev­i­dence.

If the two don’t match, then of course the Bayesian pos­te­rior dis­tri­bu­tions, rel­a­tive to the ex­per­i­menter’s higher knowl­edge, can ap­pear poorly cal­ibrated.

If the two do match, then the Bayesian should be well-cal­ibrated. Sure looks QED-ish to me.

• The ex­am­ple doesn’t come from the pa­per; I made it my­self. You only need to be­lieve the figure I cited—don’t bother with the rest of the pa­per.

Call the es­ti­mands mu_1 to mu_n; the data are x_1 to x_n. The prior over the mu pa­ram­e­ters is flat in the pos­i­tive sub­set of R^n, zero el­se­where. The sam­pling dis­tri­bu­tion for x_i is Nor­mal(mu_i,1). I don’t know the dis­tri­bu­tion the pa­ram­e­ters ac­tu­ally fol­low. The causal pro­cess is ir­rele­vant—I’ll stipu­late that the sam­pling dis­tri­bu­tion is known ex­actly.

Call the 90% quan­tiles of my pos­te­rior dis­tri­bu­tions q_i. From the sam­pling per­spec­tive, these are ran­dom quan­tities, be­ing mono­tonic func­tions of the data. Their sam­pling dis­tri­bu­tions satisfy the in­equal­ity Pr(q_i > mu_i | mu_i) < 0.9. (This is what the figure I cited shows.) As n goes to in­finity, I be­come more and more sure that my pos­te­rior in­ter­vals of the form (0, q_i] are un­der­cal­ibrated.

You might cite the im­proper prior as the source of the prob­lem. How­ever, if the pa­ram­e­ter space were un­re­stricted and the prior flat over all of R^n, the pos­te­rior in­ter­vals would by cor­rectly cal­ibrated.

But it re­ally is fair to de­mand a proper prior. How could we de­ter­mine that prior? Only by Bayesian up­dat­ing from some pre-prior state of in­for­ma­tion to the prior state of in­for­ma­tion (or equiv­a­lently, by log­i­cal de­duc­tion, pro­vided that the knowl­edge we up­date on is cer­tain). Right away we run into the prob­lem that Bayesian up­dat­ing does not have cal­ibra­tion guaran­tees in gen­eral (and for this, you re­ally ought to read the liter­a­ture), so it’s likely that any proper prior we might jus­tify does not have a cal­ibra­tion guaran­tee.

• How about this: a Bayesian will always pre­dict that she is perfectly cal­ibrated, even though she knows the the­o­rems prov­ing she isn’t.

Wanna bet? Liter­ally. Have a Bayesian to make and a whole bunch of pre­dic­tions and then offer her bets with pay­offs based on what ap­par­ent cal­ibra­tion the re­sults will re­flect. See which bets she ac­cepts and which she re­fuses.

• Are you vol­un­teer­ing?

• Sure. :)

But let me warn you… I ac­tu­ally pre­dict my cal­ibra­tion to be pretty darn awful.

• We need a trusted third party.

• Find a can­di­date.

I was about to sug­gest we could just bet raw ego points by pub­li­cly post­ing here… but then I re­al­ised I prove my point just by play­ing.

It should be ob­vi­ous, by the way, that if the pre­dic­tions you have me make per­tain to black boxes that you con­struct then I would only bet if the odds gave a money pump. There are few cases in which I would ex­pect my cal­ibra­tion to be su­pe­rior to what you could pre­dict with com­plete knowl­edge of the dis­tri­bu­tion.

• It should be ob­vi­ous, by the way, that if the pre­dic­tions you have me make per­tain to black boxes that you con­struct then I would only bet if the odds gave a money pump.

Phooey. There goes plan A.

• ;)

• Plan B in­volves try­ing to use some nasty pos­te­rior in­con­sis­tency re­sults, so don’t think you’re out of the woods yet.

• I am con­vinced in full gen­er­al­ity that be­ing offered the op­tion of a bet can only provide util­ity >= 0. So if the punch line is ‘in­sufi­ciently con­strained ra­tio­nal­ity’ then yes, the joke is on me!

And yes, I sus­pect try­ing to get my head around that pa­per would (will) be rather costly! I’m a god­dam pro­gram­mer. :P

• I vol­un­teer, if y’all tell me what to do.

• I vol­un­teer.

• I think this is in­cor­rect. A Bayesian doesn’t pre­dict a var­i­ance of zero on their cal­ibra­tion calcu­lated ten sam­ples later.

• Of course not. If you choose to care only about the things Bayes can give you, it’s a math­e­mat­i­cal fact that you can’t do bet­ter.

• I didn’t like the “by Bayesian lights” phrase ei­ther. What I take as the rele­vant part of the ques­tion is this:

Can you provide an ex­am­ple of a fre­quen­tist con­cept that can be used to make pre­dic­tions in the real world for which a bayesian pre­dic­tion will fail?

“Bayesian an­swers don’t give cov­er­age guaran­tees” doesn’t demon­strate any­thing by it­self. The ques­tion is could the ap­pli­ca­tion of Bayes give a pre­dic­tion equal to or su­pe­rior to the pre­dic­tion about the real world im­plicit in a cov­er­age guaran­tee?

If you can provide such an ex­am­ple then you will have proved many peo­ple to be wrong in a sig­nifi­cant, fun­da­men­tal way. But I haven’t seen any­thing in this thread or in ei­ther of Cyan’s which fits that cat­e­gory.

• Once again: the real-world perfor­mance (as op­posed to in­ter­nal co­her­ence) of the Bayesian method on any given prob­lem de­pends on the prior you choose for that prob­lem. If you have a well-cal­ibrated prior, Bayes gives well-cal­ibrated re­sults equal or su­pe­rior to any fre­quen­tist meth­ods. If you don’t, sci­ence knows no gen­eral way to in­vent a prior that will re­li­ably yield re­sults su­pe­rior to any­thing at all, not just fre­quen­tist meth­ods. For ex­am­ple, Jaynes spent a large part of his life search­ing for a method to cre­ate un­in­for­ma­tive pri­ors with max­ent, but max­ent still doesn’t guaran­tee you any­thing be­yond “cross your fingers”.

• If your prior is screwed up enough, you’ll also mi­s­un­der­stand the ex­per­i­men­tal setup and the like­li­hood ra­tios. Fre­quen­tism de­pends on prior knowl­edge just as much as Bayesi­anism, it just doesn’t have a good for­mal way of treat­ing it.

• I give you some num­bers taken from a nor­mal dis­tri­bu­tion with un­known mean and var­i­ance. If you’re a fre­quen­tist, your hon­est es­ti­mate of the mean will be the sam­ple mean. If you’re a Bayesian, it will be some num­ber off to the side, de­pend­ing on what­ever bul­lshit prior you man­aged to glean from my words above—and you don’t have the op­tion of skip­ping that step, and don’t have the op­tion of de­vis­ing a prior that will always ex­actly match the fre­quen­tist con­clu­sion be­cause math doesn’t al­low it in the gen­eral case . (I kinda equiv­o­cate on “hon­est es­ti­mate”, but re­fus­ing to ever give point es­ti­mates doesn’t speak well of a math­e­mat­i­cian any­way.) So nah, Bayesi­anism de­pends on pri­ors more, not “just as much”.

If to­mor­row Bayesi­ans find a good for­mal­iza­tion of “un­in­for­ma­tive prior” and a gen­eral for­mula to de­vise them, you’ll hap­pily dis­card your old bul­lshit prior and go with the flow, thus ad­mit­ting that your care­ful anal­y­sis of my words about “un­known nor­mal dis­tri­bu­tion” to­day wasn’t rele­vant at all. This is the most fishy part IMO.

(Dis­claimer: I am not a crazy-con­vinced fre­quen­tist. I’m a new­bie try­ing to get good an­swers out of Bayesi­ans, and some of the an­swers already given in these threads satisfy me perfectly well.)

• The nor­mal dis­tri­bu­tion with un­known mean and var­i­ance was a bad choice for this ex­am­ple. It’s the one case where ev­ery­one agrees what the un­in­for­ma­tive prior is. (It’s flat with re­spect to the mean and the log-var­i­ance.) This un­in­for­ma­tive prior is also a match­ing prior—pos­te­rior in­ter­vals are con­fi­dence in­ter­vals.

• I didn’t know that was pos­si­ble, thanks. (Wow, a prior with in­te­gral=in­finity! One that can’t be reached as a pos­te­rior af­ter any ob­ser­va­tion! How’d a Bayesian come by that? But seems to work re­gard­less.) What would be a bet­ter ex­am­ple?

ETA: I be­lieve the point raised in that com­ment still de­serves an an­swer from Bayesi­ans.

• ETA: I be­lieve the point raised in that com­ment still de­serves an an­swer from Bayesi­ans.

Done, but I think a more use­ful re­ply could be given if you pro­vided an ac­tual worked ex­am­ple where a fre­quen­tist tool leads you to make a differ­ent pre­dic­tion than the ap­pli­ca­tion of Bayes would (and where you pre­fer the fre­quen­tist pre­dic­tion.) Some­thing with num­bers in it and with the fre­quen­tist pre­dic­tion pro­vided.

• Here’s one. There is one data point, dis­tributed ac­cord­ing to 0.5*N(0,1) + 0.5*N(mu,1).

Bayes: any im­proper prior for mu yields an im­proper pos­te­rior (be­cause there’s a 50% chance that the data are not in­for­ma­tive about mu). Any proper prior has no cal­ibra­tion guaran­tee.

Fre­quen­tist: Ney­man’s con­fi­dence belt con­struc­tion guaran­tees valid con­fi­dence cov­er­age of the re­sult­ing in­ter­val. If the da­tum is close to 0, the in­ter­val may be the whole real line. This is just what we want [claims the fre­quen­tist, not me!]; af­ter all, when the da­tum is close to 0, mu re­ally could be any­thing.

• Can you ex­plain the terms “cal­ibra­tion guaran­tee”, and what “the re­sult­ing in­ter­val” is? Also, I don’t un­der­stand why you say there is a 50% chance the data is not in­for­ma­tive about mu. This is not a multi-modal dis­tri­bu­tion; it is blended from N(0,1) and N(mu,1). If mu can be any pos­i­tive or nega­tive num­ber, then the one data point will tell you whether mu is pos­i­tive or nega­tive with prob­a­bil­ity 1.

• Can you ex­plain the terms “cal­ibra­tion guaran­tee”...

By “cal­ibra­tion guaran­tee” I mean valid con­fi­dence cov­er­age: if I give a num­ber of in­ter­vals at a stated con­fi­dence, then rel­a­tive fre­quency with which the es­ti­mated quan­tities fall within the in­ter­val is guaran­teed to ap­proach the stated con­fi­dence as the num­ber of es­ti­mated quan­tities grows. Here we might imag­ine a large num­ber of mu pa­ram­e­ters and one da­tum per pa­ram­e­ter.

… and what “the re­sult­ing in­ter­val” is?

Not eas­ily. The sec­ond cousin of this post (a re­ply to wedrifid) con­tains a link to a pa­per on arXiv that gives a bare-bones overview of how con­fi­dence in­ter­vals can be con­structed on page 3. When you’ve got that far I can tell you what in­ter­val I have in mind.

Also, I don’t un­der­stand why you say there is a 50% chance the data is not in­for­ma­tive about mu. This is not a multi-modal dis­tri­bu­tion; it is blended from N(0,1) and N(mu,1).

I think there’s been a mi­s­un­der­stand­ing some­where. Let Z be a fair coin toss. If it comes up heads the da­tum is gen­er­ated from N(0,1); if it comes up tails, the da­tum is gen­er­ated from N(mu,1). Z is un­ob­served and mu is un­known. The prob­a­bil­ity dis­tri­bu­tion of the da­tum is as stated above. It will be mul­ti­modal if the ab­solute value of mu is greater than 2 (ac­cord­ing to some quick plots I made; I did not do a math­e­mat­i­cal proof).

If mu can be any pos­i­tive or nega­tive num­ber, then the one data point will tell you whether mu is pos­i­tive or nega­tive with prob­a­bil­ity 1.

If I ob­serve the da­tum 0.1, is mu greater than or less than 0?

• Thanks Cyan.

I’ll get back to you when (and if) I’ve had time to get my head around Ney­man’s con­fi­dence belt con­struc­tion, with which I’ve never had cause to ac­quaint my­self.

• This pa­per has a good ex­pla­na­tion. Note that I’ve left one of the steps (the “or­der­ing” that de­ter­mines in­clu­sion into the con­fi­dence belt) un­de­ter­mined. I’ll tell you the or­der­ing I have in mind if you get to the point of want­ing to ask me.

• That’s a lot of in­te­gra­tion to get my head around.

• All you need is page 3 (es­pe­cially the figure). If you un­der­stand that in depth, then I can tell you what the con­fi­dence belt for my prob­lem above looks like. Then I can give you a simu­la­tion al­gorithm and you can play around and see ex­actly how con­fi­dence in­ter­vals work and what they can give you.

• It’s called an im­proper prior. There’s been some ar­gu­ment about their use but they sel­dom lead to prob­lems. The pos­te­ri­ors usu­ally has much bet­ter be­hav­ior at in­finity and when they don’t, that’s the the­ory tel­ling us that the in­for­ma­tion doesn’t de­ter­mine the solu­tion to the prob­lem.

The ob­ser­va­tion that an im­proper prior can­not be ob­tain as a pos­te­rior dis­tri­bu­tion is kind of triv­ial. It is meant to rep­re­sent a to­tal lack of in­for­ma­tion w.r.t. some pa­ram­e­ter. As soon you have made an ob­ser­va­tion you have more in­for­ma­tion than that.

• Maybe the differ­ence lies in the for­mat of an­swers?

• We know: set of n out­puts of a ran­dom num­ber gen­er­a­tor with nor­mal dis­tri­bu­tion. Say {3.2, 4.5, 8.1}.

• We don’t know: mean m and var­i­ance v.

• Your pro­posed an­swer: m = 5.26, v = 6.44.

• A Bayesian’s an­swer: a prob­a­bil­ity dis­tri­bu­tion P(m) of the mean and an­other dis­tri­bu­tion Q(v) of the var­i­ance.

How does a fre­quen­tist get them? If he hasn’t them, what’s his con­fi­dence in m = 5.26 and v = 6.44? What if the set con­tains only one num­ber—what is the fre­quen­tist’s es­ti­mate for v? Note that a Bayesian has no prob­lem even if the data set is empty, he only rests with his pri­ors. If the data set is large, Bayesian’s an­swer will in­evitably con­verge at delta-func­tion around the fre­quen­tist’s es­ti­mate, no mat­ter what the pri­ors are.

• http://​​www.xuru.org/​​st/​​DS.asp

50% con­fi­dence in­ter­val for mean: 4.07 to 6.46, std­dev: 2.15 to 4.74

90% con­fi­dence in­ter­val for mean: 0.98 to 9.55, std­dev: 1.46 to 11.20

If there’s only one sam­ple, the calcu­la­tion fails due to di­vi­sion by n-1, so the fre­quen­tist says “no an­swer”. The Bayesian says the same if he used the im­proper prior Cyan men­tioned.

• Hm, should I un­der­stand it that the fre­quen­tist as­sumes nor­mal dis­tri­bu­tion of the mean value with peak at the es­ti­mated 5.26?

If so, then fre­quen­tism = bayes + flat prior.

Im­proper pri­ors are how­ever quite tricky, they may lead to para­doxes such as the two-en­velope para­dox.

• The prior for var­i­ance that matches the fre­quen­tist con­clu­sion isn’t flat. And even if it were, a flat prior for var­i­ance im­plies a non-flat prior for stan­dard de­vi­a­tion and vice versa. :-)

• In this prob­lem, yes. In the gen­eral case no one knows ex­actly what the flat prior is, e.g. if there are con­straints on model pa­ram­e­ters.

• Us­ing the flat im­proper prior I was talk­ing about be­fore, when there’s only one data point the pos­te­rior dis­tri­bu­tion is im­proper, so the Bayesian an­swer is the same as the fre­quen­tist’s.

• Yep, I know that. Woohoo, an im­proper prior!

• I give you some num­bers taken from a nor­mal dis­tri­bu­tion with un­known mean and var­i­ance. If you’re a fre­quen­tist, your hon­est es­ti­mate of the mean will be the sam­ple mean. If you’re a Bayesian, it will be some num­ber off to the side, de­pend­ing on what­ever bul­lshit prior you man­aged to glean from my words above—and you don’t have the op­tion of skip­ping that step, and don’t have the op­tion of de­vis­ing a prior that will always ex­actly match the fre­quen­tist con­clu­sion be­cause math doesn’t al­low it in the gen­eral case . (I kinda equiv­o­cate on “hon­est es­ti­mate”, but re­fus­ing to ever give point es­ti­mates doesn’t speak well of a math­e­mat­i­cian any­way.) So nah, Bayesi­anism de­pends on pri­ors more, not “just as much”.

A Bayesian does not have the op­tion of ‘just skip­ping that step’ and choos­ing to ac­cept whichever prior was man­dated by Fisher (or whichever other statis­ti­tian cre­ated or in­sisted upon the use of the par­tic­u­lar tool in ques­tion). It does not fol­low that the Bayesian is rely­ing on ‘Bul­lshit’ more than the fre­quen­tist. In fact, when I use the la­bel ‘bul­lshit’ I usu­ally mean ‘the use of au­thor­ity or so­cial power mechanisms in lieu of or in di­rect defi­ance of rea­son’. I ob­vi­ously ap­ply ‘bul­lshit prior’ to the fre­quen­tist op­tion in this case.

• A Bayesian does not have the op­tion of ‘just skip­ping that step’ and choos­ing to ac­cept whichever prior was man­dated by Fisher

Why in the world doesn’t a Bayesian have that op­tion? I thought you were a free peo­ple. :-) How’d you de­cide to re­ject those pri­ors in fa­vor of other ones, any­way? As far as I cur­rently un­der­stand, there’s no uni­ver­sally ac­cepted math­e­mat­i­cal way to pick the best prior for ev­ery given prob­lem and no psy­cholog­i­cally co­her­ent way to pick it of your head ei­ther, be­cause it ain’t there. In ad­di­tion to that, here’s some anec­do­tal ev­i­dence: I never ever heard of a Bayesian agent ac­cept­ing or re­ject­ing a prior.

• That was a par­tial quote and par­tial para­phrase of the claim made by cousin_it (hang on, that’s you! huh?). I thought that the “we are a free peo­ple and can use the fre­quen­tist im­plicit pri­ors when­ever they hap­pen to be the best available” claim had been made more than enough times so I left off that nit­pick and fo­cussed on my core gripe with the post in ques­tion. That is, the sug­ges­tion that us­ing pri­ors be­cause tra­di­tion tells you to makes them less ‘bul­lshit’.

I think your in­clu­sion of ‘just’ alows for the pos­si­bil­ity that off all pos­si­ble con­figu­ra­tions of prior prob­a­bil­ities the fre­quen­tist one so hap­pens to be the one worth choos­ing.

I never ever heard of a Bayesian agent ac­cept­ing or re­ject­ing a prior.

I’m con­fused. What do you mean by ac­cept­ing or re­ject­ing a prior?

• Funny as it is, I don’t con­tra­dict my­self. A Bayesian doesn’t have the op­tion of skip­ping the prior al­to­gether, but does have the op­tion of pick­ing pri­ors with fre­quen­tist jus­tifi­ca­tions, which op­tion you call “bul­lshit”, though for the life of me I can’t tell how you can tell.

Fre­quen­tists have valid rea­sons for their pro­ce­dures be­sides tra­di­tion: the pro­ce­dures can be shown to always work, in a cer­tain sense. On the other hand, I know of no Bayesian-prior-gen­er­at­ing pro­ce­dure that can be shown to work in this sense or any other sense.

I’m con­fused. What do you mean by ac­cept­ing or re­ject­ing a prior?

Some pri­ors are very bad. If a Bayesian some­how ends up with such a prior, they’re SOL be­cause they have no no­tion of re­ject­ing pri­ors.

• Some pri­ors are very bad. If a Bayesian some­how ends up with such a prior, they’re SOL be­cause they have no no­tion of re­ject­ing pri­ors.

There are two pri­ors for A that a bayesian is un­able to up­date from. p(A) = 0 and p(A) = 1. If a Bayesian ever as­signs p(a) = 0 || 1 and are mis­taken then they fail at life. No sec­ond chances. Shal­izi’s hy­po­thet­i­cal agent started with the ab­solute (and in­sane) be­lief that the dis­tri­bu­tion was not a mix of the two gaus­si­ans in ques­tion. That did not change through the ap­pli­ca­tion of Bayes rule.

Bayesi­ans can­not re­ject a prior of 0. They can ‘re­ject’ a prior of “That’s definitely not go­ing to hap­pen. But if I am faced with over­whelming ev­i­dence then I may change my mind a bit.” They just wouldn’t write that state as p=0 or im­ply it through ex­clud­ing it from the a sim­plified model with­out be­ing will­ing to re­view the model for san­ity af­ter­ward.

• I am try­ing to un­der­stand the ex­am­ples on that page, but they seem strange; shouldn’t there be a model with pa­ram­e­ters, and a prior dis­tri­bu­tion for those pa­ram­e­ters? I don’t un­der­stand the in­fer­ences. Can some­one ex­plain?

• Well, the first ex­am­ple is a model with a sin­gle pa­ram­e­ter. Roughly speak­ing, the Bayesian ini­tially be­lieves that the true model is ei­ther a Gaus­sian around 1, or a Gaus­sian around −1. The ac­tual dis­tri­bu­tion is a mix of those two, so the Bayesian has no chance of ever ar­riv­ing at the truth (the prior for the truth is zero), in­stead be­com­ing over time more and more com­i­cally over­con­fi­dent in one of the ini­tial pre­pos­ter­ous be­liefs.

• Vo­cab­u­lary nit­pick: I be­lieve you wrote “in luew of” in lieu of “in lieu of”.

Sorry, couldn’t help it. IAWYC, any­how.

• Damn that word and its ex­ces­sive vow­els!

• Can some­one do some­thing I’ve never seen any­one do—lay out a sim­ple ex­am­ple in which the Bayesian and fre­quen­tist ap­proaches give differ­ent an­swers?

• I’ve had some train­ing in Bayesian and Fre­quen­tist statis­tics and I think I know enough to say that it would be difficult to give a “sim­ple” and satis­fy­ing ex­am­ple. The rea­son is that if one is deal­ing with finite di­men­sional statis­ti­cal mod­els (this is where the pa­ram­e­ter space of the model is finite) and one has cho­sen a prior for those pa­ram­e­ters such that there is non-zero weight on the true val­ues then the Bern­stein-von Mises the­o­rem guaran­tees that the Bayesian pos­te­rior dis­tri­bu­tion and the max­i­mum like­li­hood es­ti­mate con­verge to the same prob­a­bil­ity dis­tri­bu­tion (al­though you may need to use im­proper pri­ors). The cov­ers cases where we con­sider finite out­comes such as a toss of a coin or rol­ling a die.

I apol­o­gize if that’s too much jar­gon, but for re­ally sim­ple mod­els that are easy to spec­ify you tend to get the same an­swer. Bayesian stats starts to be­have differ­ent than fre­quen­tist statis­tics in no­tice­able ways when you con­sider in­finite out­come spaces. An ex­am­ple here might be where you are con­sid­er­ing prob­a­bil­ity dis­tri­bu­tions over curves (this arises in my re­search on speech recog­ni­tion). In this case even if you have a seem­ingly sen­si­ble prior you can end up in the case where, in the limit of in­finite data, you will end up with a pos­te­rior dis­tri­bu­tion that is differ­ent from the true dis­tri­bu­tion.

In prac­tice if I am learn­ing a Gaus­sian Mix­ture Model for speech curves and I don’t have much data then Bayesian pro­ce­dures tend to be a bit more ro­bust and fre­quen­tist pro­ce­dures end up over-fit­ting (or be­ing some­what ran­dom). When I start get­ting more data us­ing fre­quen­tist meth­ods tend to be al­gorith­mi­cally more tractable and get bet­ter re­sults. So I’ll end with faster com­pu­ta­tion time and say on the task of phoneme recog­ni­tion I’ll make fewer er­rors.

I’m sorry if I haven’t ex­plained it well, the differ­ence in perfor­mance wasn’t re­ally ev­i­dent to me un­til I spent some time ac­tu­ally us­ing them in ma­chine learn­ing. Un­for­tu­nately, most of the dis­ad­van­tage of Bayesian ap­proaches aren’t ev­i­dent for sim­ple statis­ti­cal prob­lems, but they be­come all too ev­i­dent in the case of com­plex statis­ti­cal mod­els.

• Thanks much!

and one has cho­sen a prior for those pa­ram­e­ters such that there is non-zero weight on the true val­ues then the Bern­stein-von Mises the­o­rem guaran­tees that the Bayesian pos­te­rior dis­tri­bu­tion and the max­i­mum like­li­hood es­ti­mate con­verge to the same prob­a­bil­ity dis­tri­bu­tion (al­though you may need to use im­proper pri­ors)

What do “non-zero weight” and “im­proper pri­ors” mean?

EDIT: Im­proper pri­ors mean pri­ors that don’t sum to one. I would guess “non-zero weight” means “non-zero prob­a­bil­ity”. But then I would won­der why any­one would in­tro­duce the term “weight”. Per­haps “weight” is the term you use to ex­press a value from a prob­a­bil­ity den­sity func­tion that is not it­self a prob­a­bil­ity.

• No prob­lem.

Im­proper pri­ors are gen­er­ally only con­sid­ered in the case of con­tin­u­ous dis­tri­bu­tions so ‘sum’ is prob­a­bly not the right term, in­te­grate is usu­ally used.

I used the term ‘weight’ to sig­nify an in­te­gral be­cause of how I usu­ally in­tuit prob­a­bil­ity mea­sures. Say you have a ran­dom vari­able X that takes val­ues in the real line, the prob­a­bil­ity that it takes a value in some sub­set S of the real line would be the in­te­gral of S with re­spect to the given prob­a­bil­ity mea­sure.

There’s a good dis­cus­sion of this way of view­ing prob­a­bil­ity dis­tri­bu­tions in the wikipe­dia ar­ti­cle. There’s also a fan­tas­tic text­book on the sub­ject that re­ally has made a world of differ­ence for me math­e­mat­i­cally.

• I didn’t mean to re­ha­bil­i­tate fre­quen­tism! I only meant to point out that cal­ibra­tion is a fre­quen­tist op­ti­mal­ity crite­rion, and one that Bayesian pos­te­rior in­ter­vals can be proved not to have in gen­eral. I view this as a bul­let to be bit­ten, not dodged.

• It’s out of your hands now. Over­com­ing Bayes!

• I had an­other thought on the sub­ject. Con­sider flip­ping a coin; a Bayesian says that the 50% es­ti­mate of get­ting tails is just your own in­abil­ity to pre­dict with suffi­cient ac­cu­racy; a fre­quen­tist says that the 50% is a prop­erty of the coin—or to be less straw-mak­ing about it, a prop­erty of large sets of in­dis­t­in­guish­able coin-flips. So, ok, in prin­ci­ple you could build a coin-pre­dic­tor and re­move the un­cer­tainty. But now con­sider an elec­tron pass­ing through a beam split­ter. Here there is no method even in prin­ci­ple of pre­dict­ing which Everett branch you find your­self in. (Given some rea­son­able as­sump­tions about lo­cal­ity and such.) The coin has hid­den vari­ables like the pre­cise lo­ca­tion of your thumb and the ex­act force your mus­cles ap­ply to it; if you were smart enough, you could tease a pre­dic­tion out of them. But an elec­tron has no such hid­den prop­er­ties. Is it not rea­son­able, then, to say that the 50% chance re­ally is a prop­erty of the elec­tron, and not the pre­dic­tor?

• The rele­vant prop­erty of the elec­tron+beam­split­ter(+ev­ery­thing else) sys­tem is that its wave­func­tion will be evenly split be­tween the two Everett branches. No chance in­volved. 50% is how much I care about each branch.

And af­ter perform­ing the ex­per­i­ment but be­fore look­ing at the re­sult, I can con­tinue us­ing the same rea­son­ing: “I have already de­co­hered, but what­ever de­ter­minis­tic de­ci­sion al­gorithm I ap­ply now will re­turn the same an­swer in both branches, so I can and should op­ti­mize both out­comes at once.” Or I can switch to in­dex­i­cal un­cer­tainty: “I am un­cer­tain about which in­stance I am, even though I know the state of the uni­verse with cer­tainty.” Th­ese two meth­ods should be equiv­a­lent.

If we ever do find some non­de­ter­minis­tic phys­i­cal law, then you can have your prob­a­bil­ity as a fun­da­men­tal prop­erty of par­ti­cles. Maybe. I’m not sure how one would ex­per­i­men­tally dis­t­in­guish “one stochas­tic world” from “branch both ways” or from “se­cure pseudo-ran­dom num­ber gen­er­a­tor” in the ab­sence of any in­terfer­ence pat­tern to have a pre­cise the­ory of; but I’m not go­ing to spec­u­late here about what physi­cists can or can’t learn.

• I be­lieve the an­swer to this ques­tion is cur­rently “we don’t know”. But no­tice that “the elec­tron” doesn’t ex­ist, it’s a pat­tern (“just” a pat­tern? :)) in the wave­func­tion. A pat­tern which hap­pens to oc­cur in lots of places, so we call it an elec­tron.

My in­tu­ition, IANAP, is that if any­thing it is more nat­u­ral to say the 50% be­longs some­how to which branch you find your­self in, not the pat­tern in the wave­func­tion we call an elec­tron.

• Ok, but I don’t think that mat­ters for the ques­tion of fre­quen­tist ver­sus Bayesian. You’re still say­ing that the 50% is a prop­erty of some­thing other than your own un­cer­tainty.

Mov­ing the prob­lem to lex­i­cal un­cer­tainty seems to me to rely on mov­ing the ques­tion in time; you can only do this af­ter you’ve done the ex­per­i­ment but be­fore you’ve looked at the mea­sure­ment. This feels to me like ask­ing a differ­ent ques­tion.

• Fi­nally, the elec­tron is found at some cer­tain po­lari­sa­tion. You just don’t know which be­fore ac­tu­ally do­ing the ex­per­i­ment (same as for the coin) and you can’t make in prin­ci­ple (at least ac­cord­ing to pre­sent model of physics—don’t for­get that non-lo­cal hid­den vari­ables are not ruled out) any ob­ser­va­tion which tells you the re­sult with more cer­tainty in ad­vance (for coin you can). So, the differ­ence is that the fu­ture of a clas­si­cal sys­tem can be pre­dicted with un­limited cer­tainty from its pre­sent state, while for quan­tum sys­tem not so. This doesn’t nec­es­sar­ily mean that the fu­ture is not de­ter­mined. One can adopt the view­point (I think that it was even sug­gested on OB/​LW in Eliezer’s posts about time­less physics) that fu­ture is sym­met­ric to the past—it ex­ists in the whole his­tory of uni­verse, and if we don’t know it now, it’s our ig­no­rance. I sup­pose you would agree that not know­ing about the elec­tron’s past is a mat­ter of our ig­no­rance rather than a prop­erty of the elec­tron it­self, with­out re­gard to whether we are able to calcu­late it from presently available in­for­ma­tion, even in prin­ci­ple (i.e. us­ing pre­sent the­o­ries).

I also think that it has lit­tle merit to en­gage in dis­cus­sions about ter­minol­ogy and this one tends in that di­rec­tion. Prac­ti­cally there’s no differ­ence be­tween say­ing that quan­tum prob­a­bil­ities are “prop­er­ties of the sys­tem” or “of the pre­dic­tor”. Either we can pre­dict, or not, and that’s all what mat­ters. Be­ware of the clause “in prin­ci­ple”, as it of­ten only ob­scures the de­bate.

Edit: to for­mu­late it a lit­tle bit differ­ently, pre­dictabil­ity is an in­stance of reg­u­lar­ity in the uni­verse, i.e. our abil­ity to com­press the data of the whole his­tory of the uni­verse into some brief set of laws and pos­si­bly not so brief set of ini­tial con­di­tions, nev­er­the­less much smaller amount of in­for­ma­tion that the his­tory of the uni­verse recorded at each point and time in­stant. As we do not have this huge pack of in­for­ma­tion and thus can’t say to what ex­tent it is com­press­ible, we use the­o­ries that are based much on in­duc­tion, which it­self is a par­tic­u­lar bias. We don’t know even whether the the­o­ries we use ap­ply at any time and place, of for any sys­tem uni­ver­sally. Fre­quen­tist seem to dis­t­in­guish this un­cer­tainty—which they largely ig­nore in prac­tice—from un­cer­tainty as a prop­erty of the sys­tem. So, as I un­der­stand the state of af­fairs, a fre­quen­tist is satis­fied with a the­ory (which is a com­pri­ma­tion al­gorithm ap­pli­ca­ble to the in­for­ma­tion about the uni­verse) which in­cludes call­ing the ran­dom num­ber gen­er­a­tor at some oc­ca­sions (e.g. when deal­ing with dice or elec­trons), and such in­duced un­cer­tainty he calls “prop­erty of the sys­tem”. On the other hand, the un­cer­tainty about the the­ory it­self is a differ­ent kind of “meta-un­cer­tainty”.

The Bayesian ap­proach seems to me more el­e­gant (and Oc­cam-ra­zor friendly) as it doesn’t in­tro­duce differ­ent sorts of un­cer­tain­ties. It also fits bet­ter with the view of phys­i­cal laws as com­pri­ma­tion al­gorithms, as it doesn’t dis­t­in­guish be­tween data and the­o­ries with re­gard to their un­cer­tainty. One may just ac­cept that the his­tory of uni­verse needn’t be com­press­ible to data available at the mo­ment, and use in­duc­tion to es­ti­mate fu­ture states of the world in the same way as one es­ti­mates limits of val­idity of presently for­mu­lated phys­i­cal laws.

• That’s what Jaynes did to achieve his awe­some vic­to­ries: use trained in­tu­ition to pick good pri­ors by hand on a per-sam­ple ba­sis.

… as if ap­ply­ing the clas­si­cal method doesn’t re­quire us­ing trained in­tu­ition to use the “right” method for a par­tic­u­lar kind of prob­lem, which amounts to choos­ing a prior but do­ing it im­plic­itly rather than ex­plic­itly …

Our in­fer­ence is con­di­tional on our as­sump­tions [for ex­am­ple, the prior P(Lambda)]. Crit­ics view such pri­ors as a difficulty be­cause they are `sub­jec­tive’, but I don’t see how it could be oth­er­wise. How can one perform in­fer­ence with­out mak­ing as­sump­tions? I be­lieve that it is of great value that Bayesian meth­ods force one to make these tacit as­sump­tions ex­plicit.

McKay, in­for­ma­tion the­ory, learn­ing and inference

• Fre­quen­tist meth­ods of­ten have math­e­mat­i­cal jus­tifi­ca­tions, so Bayesian pri­ors should have them too.

• Since we’re dis­cussing (among other things) non­in­for­ma­tive pri­ors, I’d like to ask: does any­one know of a de­cent (non­in­for­ma­tive) prior for the space of sta­tion­ary, bidi­rec­tion­ally in­finite se­quences of 0s and 1s?

Of course in any prac­ti­cal in­fer­ence prob­lem it would be pointless to con­sider the in­finite joint dis­tri­bu­tion, and you’d only need to con­sider what hap­pens for a finite chunk of bits, i.e. a higher-or­der Markov pro­cess, de­scribed by a bunch of pa­ram­e­ters (prob­a­bil­ities) which would need to satisfy some lin­ear in­equal­ities. So it’s easy to find a prior for the space of mth-or­der Markov pro­cesses on {0,1}; but these ob­vi­ous (uniform) pri­ors aren’t co­her­ent with each other.

I sup­pose it’s pos­si­ble to nor­mal­ize these pri­ors so that they’re co­her­ent, but that seems to re­sult in much ugli­ness. I just won­der if there’s a more el­e­gant solu­tion.

• I sup­pose it de­pends what you want to do, first I would point out that the set is in a bi­jec­tion with the real num­bers (think of two sim­ple in­jec­tions and then use Can­tor–Bern­stein–Schroeder), so you can use any prior over the real num­bers. The fact that you want to look at in­finite se­quences of 0s and 1s seems to im­ply that you are con­sid­er­ing a spe­cific type of prob­lem that would de­mand a very par­tic­u­lar mean­ing of ‘non-in­for­ma­tive prior’. What I mean by that is that any ‘non­in­for­ma­tive prior’ usu­ally in­cor­po­rates some kind of in­var­i­ance: e.g. a uniform prior on [0,1] for a Bernoulli dis­tri­bu­tion is in­var­i­ant with re­spect to the true value be­ing any­where in the in­ter­val.

• The pur­pose would be to pre­dict reg­u­lar­i­ties in a “lan­guage”, e.g. to try to achieve de­cent data com­pres­sion in a way similar to other Markov-chain-based ap­proaches. In terms of prop­er­ties, I can’t think of any non­triv­ial ones, ex­cept the usual im­por­tant one that the prior as­sign nonzero prob­a­bil­ity to ev­ery open set; mainly I’m just try­ing to find some­thing that I can imag­ine com­put­ing with.

It’s true that there ex­ists a bi­jec­tion be­tween this space and the real num­bers, but it doesn’t seem like a very nat­u­ral one, though it does work (it’s mea­surable, etc). I’ll have to think about that one.

• What topol­ogy are you putting on this set?

I made the point about the real num­bers be­cause it shows that putting a non-in­for­ma­tive prior on the in­finite bidi­rec­tional se­quences should be at least as hard as for the real num­bers (which is non-triv­ial).

Usu­ally a reg­u­lar­ity is defined in terms of a par­tic­u­lar com­pu­ta­tional model, so if you picked Tur­ing ma­chines (or the var­i­ant that works with bidi­rec­tional in­finite tape, which is ba­si­cally the same class as in­finite tape in one di­rec­tion), then you could in­stead be­gin con­struct­ing your prior in terms of Tur­ing ma­chines. I don’t know if that helps any.

• Each el­e­ment of the set is char­ac­ter­ized by a bunch of prob­a­bil­ities; for ex­am­ple there is p_01101, which is the prob­a­bil­ity that el­e­ments x_{i+1} through x_{i+5} are 01101, for any i. I was think­ing of us­ing the topol­ogy in­duced by these maps (i.e. gen­er­ated by preimages of open sets un­der them).

How is putting a non­in­for­ma­tive prior on the re­als hard? With the usual re­quired in­var­i­ance, the uniform (im­proper) prior does the job. I don’t mind hav­ing the prior be im­proper here ei­ther, and as I said I don’t know what in­var­i­ance I should want; I can’t think of many in­ter­est­ing group ac­tions that ap­ply. Though of course 0 and 1 should be treated sym­met­ri­cally; but that’s triv­ial to ar­range.

I guess you’re right that reg­u­lar­i­ties can be de­scribed more gen­er­ally with com­pu­ta­tional mod­els; but I ex­pect them to be harder to deal with than this (rel­a­tively) sim­ple, non­com­pu­ta­tional (though stochas­tic) model. I’m not look­ing for reg­u­lar­i­ties among the mod­els, so I’m not sure how a com­pu­ta­tional model would help me.

Now hav­ing no rea­son to oth­er­wise, I de­cided to as­sign each of the 64 se­quences a prior prob­a­bil­ity of 164 of oc­cur­ring. Now, of course, You may think oth­er­wise but that is Your busi­ness and not My con­cern. (I, as a Bayesian, have a ten­dency to cap­i­tal­ise pro­nouns but I don’t care what You think. Strictly speak­ing, as a new con­vert to sub­jec­tivist philos­o­phy, I don’t even care whether you are a Bayesian. In fact it is a bit of mys­tery as to why we Bayesi­ans want to con­vert any­body. But then “We” is in any case a mean­ingless con­cept. There is only I and I don’t care whether this di­gres­sion has con­fused You.) I then set about ac­quiring some ex­pe­rience with the coin. Now as De Finetti (vol 1 p141) points out, “ex­pe­rience, since ex­pe­rience is noth­ing more than the ac­qui­si­tion of fur­ther in­for­ma­tion—acts always and only in the way we have just de­scribed: sup­press­ing the al­ter­na­tives that turn out to be no longer pos­si­ble...” (His ital­ics)

The moral of this story seems to be, As­sume pri­ors over gen­er­a­tors, not over se­quences. A non­in­for­ma­tive prior over the re­als will never learn that the digit af­ter 0100 is more likely to be 1, no mat­ter how much data you feed it.

• Right, that is a good piece. But I’m afraid I was un­clear. (Sorry if I was.) I’m look­ing for a prior over sta­tion­ary se­quences of digits, not just se­quences. I guess the ad­jec­tive “sta­tion­ary” can be in­ter­preted in two com­pat­i­ble ways: ei­ther I’m talk­ing about se­quences such that for ev­ery pos­si­ble string w the pro­por­tion of sub­strings of length |w| that are equal to |w|, among all sub­strings of length |w|, tends to a limit as you con­sider more and more sub­strings (ei­ther ex­tend­ing for­ward or back­ward in the se­quence); this would not quite be a prior over gen­er­a­tors, and isn’t what I meant.

The cleaner thing I could have meant (and did) is the col­lec­tion of sta­tion­ary se­quence-val­ued ran­dom vari­ables, each of which (up to iso­mor­phism) is com­pletely de­scribed by the prob­a­bil­ities p_w of a string of length |w| com­ing up as w. Th­ese, then, are gen­er­a­tors.

• Janos, I spent some days pars­ing your re­quest and it’s quite com­plex. Cosma Shal­izi’s the­sis and al­gorithm seem to ad­dress your prob­lem in a fre­quen­tist man­ner, but I can’t yet work out any good Bayesian solu­tion.

• One is­sue with say tak­ing a nor­mal dis­tri­bu­tion and let­ting the var­i­ance go to in­finity (which is the im­proper prior I nor­mally use) is that the pos­te­rior dis­tri­bu­tion dis­tri­bu­tion is go­ing to have a finite mean, which may not be a de­sired prop­erty of the re­sult­ing dis­tri­bu­tion.

You’re right that there’s no es­sen­tial rea­son to re­late things back to the re­als, I was just us­ing that to illus­trate the difficulty.

I was think­ing about this a lit­tle over the last few days and it oc­curred to me that one model for what you are dis­cussing might ac­tu­ally be an in­finite graph­i­cal model. The in­finite bi-di­rec­tional se­quence here are the val­ues of bernoulli-dis­tributed ran­dom vari­ables. Prob­a­bly the most in­ter­est­ing case for you would be a Markov-ran­dom field, as the stochas­tic ‘pat­terns’ you were dis­cussing may be de­scribed in terms of de­pen­den­cies be­tween ran­dom vari­ables.

Here’s three pa­pers I read a lit­tle while back on the topic (and re­lated to) some­thing called an In­dian Buffet pro­cess: (http://​​www.cs.utah.edu/​​~hal/​​docs/​​daume08ih­frm.pdf) (http://​​co­cosci.berkeley.edu/​​tom/​​pa­pers/​​ibptr.pdf) (http://​​www.cs.man.ac.uk/​​~mtit­sias/​​pa­pers/​​nips07.pdf)

Th­ese may not quite be what you are look­ing for since they deal with a bound on the ex­tent of the in­ter­ac­tions, you prob­a­bly want to think about prob­a­bil­ity dis­tri­bu­tions of bi­nary ma­tri­ces with an in­finite num­ber of rows and columns (which would cor­re­spond to an ad­ja­cency ma­trix over an in­finite graph).

• Per­haps we can try an ex­per­i­ment? We have here, ap­par­ently, both Bayesi­ans and fre­quen­tists; or at a min­i­mum, peo­ple knowl­edge­able enough to be able to ap­ply both meth­ods. Sup­pose I gen­er­ate 25 data points from some dis­tri­bu­tion whose na­ture I do not dis­close, and ask for es­ti­mates of the true mean and stan­dard de­vi­a­tion, from a Bayesian and a fre­quen­tist? The un­der­ly­ing anal­y­sis would also be wel­come. If nec­es­sary we could ex­tend this to 100 sets of data points, ask for 95% con­fi­dence in­ter­vals, and see if the meth­ods are well cal­ibrated. (This does prob­a­bly re­quire some bet­ter method of trans­fer­ring data than blog com­ments, though.)

As a start, here is one data set:

617.91 16.8539 83.4021 141.504 545.112 215.863 553.168 414.435 4.71129 609.623 117.189 −102.648 647.449 283.57 286.838 710.811 505.826 79.3366 171.816 105.332 540.313 429.298 −314.32 255.93 382.471

It is pos­si­ble that this task does not have suffi­cient difficulty to dis­t­in­guish be­tween the ap­proaches. If so, how can we add con­straints to get differ­ent an­swers?

• There’s a difficulty with your ex­per­i­men­tal setup in that you im­plic­itly are in­vok­ing a prob­a­bil­ity dis­tri­bu­tion over prob­a­bil­ity dis­tri­bu­tions (since you rep­re­sent a ran­dom choice of a dis­tri­bu­tion). The re­sults are go­ing to be highly de­pen­dent upon how you con­struct your dis­tri­bu­tion over dis­tri­bu­tions. If your out­come space for prob­a­bil­ity dis­tri­bu­tions is in­finite (which is what I would ex­pect), and you sam­pled from a broad enough class of dis­tri­bu­tions then a sam­pling of 25 data points is not enough data to say any­thing sub­stan­tive.

A friend of yours who knows what dis­tri­bu­tions you’re go­ing to se­lect from, though, could in­cor­po­rate that knowl­edge into a prior and then use that to win.

So, I pre­dict that for your setup there ex­ists a Bayesian who would be able to con­sis­tently win.

But, if you gave much more data and you sam­pled from a rich enough set of prob­a­bil­ity dis­tri­bu­tions that pri­ors would be­come hard to spec­ify a fre­quen­tist pro­ce­dure would prob­a­bly win out.

• Hmm. I don’t know if I’m a very ran­dom source of dis­tri­bu­tions; hu­mans are no­to­ri­ously bad at ran­dom­ness, and there are only so many dis­tri­bu­tions read­ily available in stan­dard libraries. But in any case, I don’t see this as a difficulty; a real-world prob­lem is un­der no obli­ga­tion to give you an eas­ily recog­nised dis­tri­bu­tion. If Bayesi­ans do bet­ter when the dis­tri­bu­tion is un­known, good for them. And if not, tough beans. That is pre­cisely the sort of thing we’re try­ing to mea­sure!

I don’t think, though, that the ex­is­tence of a Bayesian who can win, based on know­ing what dis­tri­bu­tions I’m likely to use, is a very strong state­ment. Similarly there ex­ists a fre­quen­tist who can win based on watch­ing over my shoulder when I wrote the pro­gram! You can always win by in­vok­ing spe­cial knowl­edge. This does not say any­thing about what would hap­pen in a real-world prob­lem, where spe­cial knowl­edge is not available.

• You can ac­tu­ally simu­late a tremen­dous num­ber of dis­tri­bu­tions (and the­o­ret­i­cally any to an ar­bi­trary de­gree of ac­cu­racy) by do­ing an ap­prox­i­mate in­verse CDF ap­plied to a stan­dard uniform ran­dom vari­able see here for ex­am­ple. So the space of dis­tri­bu­tions from which you could se­lect to do your test is po­ten­tially in­finite. We can then think of your se­lec­tion of a prob­a­bil­ity dis­tri­bu­tion as be­ing a ran­dom ex­per­i­ment and model your se­lec­tion pro­cess us­ing a prob­a­bil­ity dis­tri­bu­tion.

The is­sue is that since the out­come space is the space of all com­putable prob­a­bil­ity dis­tri­bu­tions Bayesi­ans will have con­sis­tency prob­lems (an­other good pa­per on the topic is here), i.e. the pos­te­rior dis­tri­bu­tion won’t con­verge to the true dis­tri­bu­tion. So in this par­tic­u­lar set up I think Bayesian meth­ods are in­fe­rior un­less one could de­vise a good prior over what dis­tri­bu­tions, I sup­pose if I knew that you didn’t know how to sam­ple from ar­bi­trary prob­a­bil­ity dis­tri­bu­tions then if I put that in my prior then I may be able to use Bayesian meth­ods to suc­cess­fully es­ti­mate the prob­a­bil­ity dis­tri­bu­tion (the dis­cus­sion of the Bayesian who knew you per­son­ally was meant to be tongue-in-cheek).

In the fre­quen­tist case there is a known pro­ce­dure due to Parzen from the 60′s .

All of these are asymp­totic re­sults, how­ever, your ex­per­i­ment seems to be fo­cused on very small sam­ples. To the best of my knowl­edge there aren’t many re­sults in this case ex­cept un­der spe­cial con­di­tions. I would state that with­out more con­straints on the ex­per­i­men­tal de­sign I don’t think you’ll get very in­ter­est­ing re­sults. Although I am ac­tu­ally re­ally in fa­vor of such eval­u­a­tions be­cause peo­ple in statis­tics and ma­chine learn­ing for a va­ri­ety of rea­sons don’t do them, or don’t do them on a broad enough scale. Any­way if you ac­tu­ally are in­ter­ested in such things you may want to start look­ing here, since statis­tics and ma­chine learn­ing both have the tools to prop­erly de­sign such ex­per­i­ments.

• The small sam­ples are a con­straint im­posed by the limits of blog com­ments; there’s a limit to how many num­bers I would feel com­fortable spam­ming this place with. If we got some vol­un­teers, we might do a more se­ri­ous sam­ple size us­ing hosted ROOT ntu­ples or zip­ping up some plain ASCII.

I do know how to sam­ple from ar­bi­trary dis­tri­bu­tions; I should have speci­fied that the space of dis­tri­bu­tions is those for which I don’t have to think for more than a minute or so, or in other words, some­one has already coded the CDF in a library I’ve already got in­stalled. It’s not knowl­edge but work that’s the limit­ing fac­tor. :) Pre­sum­ably this limits your prior quite a lot already, there be­ing only so many com­monly used math libraries.

• Ha ha—this is a Bayesian prob­lem drawn from a Bayesian per­spec­tive!

Surely a fre­quen­tist would have a differ­ent per­spec­tive and pro­pose a differ­ent kind of solu­tion. In­stead of de­sign­ing an ex­per­i­ment to de­ter­mine which is bet­ter, how about ex­trap­o­lat­ing from the ev­i­dence we already have. Hu­mans have made a cer­tain amount of progress in math­e­mat­ics—has this math­e­mat­ics been mainly de­vel­oped by fre­quen­tists or Bayesi­ans?

(Case closed, I think.)

I roughly con­sider Bayesi­ans the ex­per­i­men­tal sci­en­tists and fre­quen­tists the the­o­ret­i­cal sci­en­tists. Math­e­mat­ics is the­o­ret­i­cal, which is why the fre­quen­tists cluster there. Do you dis­agree with this?

(Nev­er­the­less, the challenge sounds fun.)

• You could use the same ar­gu­ment against the use of com­put­ers in sci­ence—af­ter all, New­ton didn’t have a com­puter, and nei­ther did Ein­stein. Case closed, I think.

• This is the com­ment Nomin­ull was refer­ring to:

Ha ha—this is a Bayesian prob­lem drawn from a Bayesian per­spec­tive!

Surely a fre­quen­tist would have a differ­ent per­spec­tive and pro­pose a differ­ent kind of solu­tion. In­stead of de­sign­ing an ex­per­i­ment to de­ter­mine which is bet­ter, how about ex­trap­o­lat­ing from the ev­i­dence we already have. Hu­mans have made a cer­tain amount of progress in math­e­mat­ics—has this math­e­mat­ics been mainly de­vel­oped by fre­quen­tists or Bayesi­ans?

(Case closed, I think.)

I roughly con­sider Bayesi­ans the ex­per­i­men­tal sci­en­tists and fre­quen­tists the the­o­ret­i­cal sci­en­tists. Math­e­mat­ics is the­o­ret­i­cal, which is why the fre­quen­tists cluster there. Do you dis­agree with this?

(Nev­er­the­less, the challenge sounds fun.)

My re­sponse to Nomin­ull: the cases aren’t re­ally par­allel, but I do need to em­pha­size that I don’t think the Bayesian per­spec­tive is wrong; it just hasn’t been the per­spec­tive, his­tor­i­cally, of most math­e­mat­i­ci­ans.

… but, fi­nally, when I think of Baysian math­e­mat­ics be­ing a new or un­der-util­ised thing, I see an anal­ogy with com­put­ers. Per­haps Bayesian the­ory could be a power-horse for new math­e­mat­ics. I guess my per­spec­tive was that math­e­mat­i­ci­ans will use whichever tools available to them, and they used fre­quen­tist the­ory in­stead. But per­haps they didn’t un­der­stand Bayesian tools or it wasn’t the time for them yet.

• This is the com­ment Nomin­ull was refer­ring to:

Voted the cour­tesy re­post back up to zero. I most likely down­voted the origi­nal post for blatant silli­ness but re­ally, why pe­nal­ise po­lite­ness? In fact, I’d up­vote the deleted great grand­par­ent for demon­strat­ing chang­ing one’s mind (on the ap­pli­ca­bil­ity of a par­tic­u­lar point), in defi­ance of rather strong bi­ases against do­ing that.

I roughly con­sider Bayesi­ans the ex­per­i­men­tal sci­en­tists and fre­quen­tists the the­o­ret­i­cal sci­en­tists. Math­e­mat­ics is the­o­ret­i­cal, which is why the fre­quen­tists cluster there. Do you dis­agree with this?

I con­sider fre­quen­tist ex­per­i­men­tal sci­en­tists to be po­ten­tially com­pe­tent in what they do. After all, available fre­quen­tist tech­niques are good enough that the sig­nifi­cant prob­lems with the ap­pli­ca­tion of stas­tics are in the mi­suse of fre­quen­tist tools, more so than them be­ing used at all. As for the­o­ret­i­cal fre­quen­tists… I sug­gest that any­one who makes a se­ri­ous in­ves­ti­ga­tion into de­vel­op­ments in prob­a­bil­ity the­ory and statis­tics will not re­main a fre­quen­tist. I claim that what ‘the­o­ret­i­cal fre­quen­tists’ do is or­thoganal to the­ory (but of­ten pre­cisely in line with what academia is re­ally about).

• What does one read to be­come well versed in this stuff in two days; and how much skill with maths does it re­quire?

• Ouch! Now I see the two days stuff looks like boast­ing. Don’t worry, all my LW posts up to now have con­tained stupid math­e­mat­i­cal mis­takes, and chances are peo­ple will find er­rors in this one too :-)

(ETA: sure enough, Eliezer has found one. Luck­ily it wasn’t crit­i­cal.)

I have a de­gree in math and com­peted at the na­tional level in my teens (both in Rus­sia), but haven’t done any se­ri­ous math since I grad­u­ated six years ago. The sources for this post were mostly Wikipe­dia and Google searches on key­words from Wikipe­dia.

• My com­ment was an hon­est ques­tion and was not in­tended as deroga­tory...

• I’m sur­prised that no­body has men­tioned the Univer­sal Prior yet. Eliezer also wrote a post on it.

• I think this was a great post for hav­ing both con­text and links and speci­fi­cally (rather than gen­er­ally) ques­tion­ing as­sump­tions the group hasn’t vis­ited in a while (if ever).

• … What is it that fre­quen­tists do, again? I’m a lit­tle out of touch.

• Strong ev­i­dence can always defeat strong pri­ors, and vice versa.

Is there any­thing more to the is­sue than this?

• This isn’t always the case if the prior puts zero prob­a­bil­ity weight on the true model. This can be avoided on finite out­come spaces, but for in­finite out­come spaces no mat­ter how much ev­i­dence you have you may not over­come the prior.

• I thought that 0 and 1 were Bayesian sins, unattain­able +/​- in­finity on the log-odds scale, and how­ever strong your pri­ors, you never make them that strong.

• In finite di­men­sional pa­ram­e­ter spaces sure, this makes perfect sense. But sup­pose that we are con­sid­er­ing a stochas­tic pro­cess X1, X2, X3, …. where Xn is fol­lows a dis­tri­bu­tion Pn over the in­te­gers. Now put a prior on the dis­tri­bu­tion and sup­pose that un­be­known to you Pn is the dis­tri­bu­tion that puts 12 prob­a­bil­ity weight on -n and 12 prob­a­bil­ity weight on n. If the prior on the stochas­tic pro­cess does not put in­creas­ing weight on in­te­gers with large ab­solute value, then in the limit the prior puts zero prob­a­bil­ity weight on the true dis­tri­bu­tion (and may start be­hav­ing strangely quite early on in the pro­cess).

Another case is that the true prob­a­bil­ity model may be too com­pli­cated to write down or com­pu­ta­tion­ally in­fea­si­ble to do so (say a Gaus­sian mix­ture with 10^(10) mix­ture com­po­nents, which is cer­tainly rea­son­able in a mod­ern high-di­men­sional database), so one may only con­sider prob­a­bil­ity dis­tri­bu­tions that ap­prox­i­mate the true dis­tri­bu­tion and put zero weight on the true model, i.e. it would be sen­si­ble in that case to have a prior that may put zero weight on the true model and you would search only for an ap­prox­i­ma­tion.

• I didn’t mean to re­ha­bil­i­tate fre­quen­tism! I only meant to point out that cal­ibra­tion is a fre­quen­tist op­ti­mal­ity crite­rion, and that it’s one that Bayesian pos­te­rior in­ter­vals can be proved not to have in gen­eral.

• Too late. I have already up­dated to be­lieve that a the­ory that de­mands pri­ors can’t be com­plete. Cor­rect, maybe, but not com­plete. We should work out an ap­proach that works well on more crite­ria in­stead of guard­ing the truth of what we already know.

If Bayes were the com­plete an­swer, Jaynes wouldn’t have felt the need to in­vent max­ent or gen­er­al­ize the in­differ­ence prin­ci­ple. That may be the cor­rect di­rec­tion of in­quiry.

ETA: this was a re­sponse to Cyan say­ing he didn’t mean to re­ha­bil­i­tate fre­quen­tism. :-)

• I’d like to take ad­van­tage of fre­quen­tism’s re­turn to re­spectabil­ity to ask if any­onw knows where I can get a copy of “An In­tro­duc­tion to the Boot­strap” by Efron and Tib­shirani.

It’s on Google books, but I don’t like read­ing things through Google books. It’s for sale on-line, but it costs a lot and ship­ping takes a while. My uni­ver­sity’s library is sup­posed to have it, but the librar­i­ans can’t find it. My lo­cal library hasn’t heard of it.

I hardly know any statis­tics or prob­a­bil­ity; I’ve just been bor­row­ing bits and pieces as I need them with­out wor­ry­ing about bayesian vs. fre­quen­tism.

There is a lit­tle some­thing that’s been both­er­ing me in the back of my mind when I see Eliezer wax­ing po­etic about bayesi­anism. Maybe this is an ig­no­rant ques­tion, but here it is:

If bayesi­ans don’t be­lieve in a true prob­a­bil­ity wait­ing to be ap­prox­i­mated, only in prob­a­bil­ities as­signed by a mind, how do they jus­tify seek­ing ad­di­tional data? The rules re­quire you to re­act to new data by mov­ing your as­signed prob­a­bil­ity in a cer­tain way, but, with­out some­thing de­sir­able that you’re mov­ing to­wards, why is it good to have that new data?

• If bayesi­ans don’t be­lieve in a true prob­a­bil­ity wait­ing to be ap­prox­i­mated, only in prob­a­bil­ities as­signed by a mind, how do they jus­tify seek­ing ad­di­tional data? The rules re­quire you to re­act to new data by mov­ing your as­signed prob­a­bil­ity in a cer­tain way, but, with­out some­thing de­sir­able that you’re mov­ing to­wards, why is it good to have that new data?

Col­lect­ing new data is not jus­tifi­able in gen­eral—the cost of the new data may out­weigh the benefit to be gained from it. But let’s as­sume that col­lect­ing new data has a neg­ligible cost. As a Bayesian, what you de­sire is the small­est loss pos­si­ble. For rea­son­able loss func­tions, the smaller the re­gion over which your dis­tri­bu­tion spreads its un­cer­tainty (that is to say, the smaller its var­i­ance) the smaller you ex­pect your loss to be. The law of to­tal var­i­ance can be in­ter­preted to say that you ex­pect the var­i­ance of the pos­te­rior dis­tri­bu­tion to be smaller than the var­i­ance of the prior dis­tri­bu­tions.* So col­lect more data!

* law of to­tal var­i­ance: prior var­i­ance = prior ex­pec­ta­tion of pos­te­rior var­i­ance + prior var­i­ance of pos­te­rior mean. This im­plies that the prior var­i­ance is larger than the prior ex­pec­ta­tion of pos­te­rior var­i­ance.

• So, more data is good be­cause it makes you more con­fi­dent? I guess that makes sense, but it still seems strange not to care what you’re con­fi­dent in.

• In any real prob­lem there is a con­text and some prior in­for­ma­tion. Bayes doesn’t give this to you—you give it to Bayes along with the data and turn the crank on the ma­chin­ery to get the pos­te­rior. The things you’re con­fi­dent about are in the con­text.

• In the­ory, if you can change your mind about some­thing, you have un­cer­tainty about it, and your prior dis­tri­bu­tion should re­flect that. In prac­tice, you ab­stract the un­cer­tainty away by mak­ing some sim­plify­ing as­sump­tions, do the anal­y­sis con­di­tional on your as­sump­tions, and re­serve the right to re­visit the as­sump­tions if they don’t seem ad­e­quate.

• I didn’t mean to ask how a bayesian changes his or her mind. I meant to ask how the thing you be­lieve in can be in the con­text in situ­a­tions where you change your mind based on new ev­i­dence.

• Let’s say I’m weigh­ing some acry­lamide pow­der on an elec­tronic bal­ance. (Gonna make me some poly­acry­lamide gel!) The bal­ance is so sen­si­tive that small changes in air pres­sure reg­ister in the last two digits. From what I know about air pres­sure vari­a­tions from hav­ing done this be­fore, I cre­ate a model for the data. Also be­cause I’ve done this be­fore, I can eye­ball roughly how much pow­der I’ve got on the bal­ance; this de­ter­mines my prior dis­tri­bu­tion be­fore read­ing the bal­ance. Then I ob­serve some data from the bal­ance read­out and up­date my dis­tri­bu­tion.

• I can’t tell with­out more in­for­ma­tion whether that’s an ex­am­ple of what I mean by “chang­ing your mind.” Here’s one that I think definitely qual­ifies:

Let’s say you’re go­ing to bet on a coin toss. You only have a small amount of in­for­ma­tion on the coin, and you de­cide for what­ever rea­son that there’s a 51% chance of get­ting heads. So you’re go­ing to bet on heads. But then you re­al­ize that there’s a way to get more data.

At this point, I’m think­ing, “Gee, I hardly know any­thing about this coin. Maybe I’m bet­ter off bet­ting on tails and I just don’t know it. I should get that data.”

What I think you’re say­ing about bayesi­ans is that a bayesian would say, “Gee, 51% isn’t very high. I’d like to be at least 80% sure. Since I don’t know very much yet, it wouldn’t take much more to get to 80%. I should get that data so I can bet on heads with con­fi­dence.”

Which sort of makes sense but is also a lit­tle strange.

• Tech­ni­cal stuff: un­der the stan­dard as­sump­tion of in­finite ex­change­abil­ity of coin tosses, there ex­ists some limit­ing rel­a­tive fre­quency for coin toss re­sults. (This is de Finetti’s the­o­rem.)

Key point: I have a prob­a­bil­ity dis­tri­bu­tion for this rel­a­tive fre­quency (call it f) -- not a prob­a­bil­ity of a prob­a­bil­ity.

You only have a small amount of in­for­ma­tion on the coin, and you de­cide for what­ever rea­son that there’s a 51% chance of get­ting heads. So you’re go­ing to bet on heads. But then you re­al­ize that there’s a way to get more data.

Here you’ve said that my prob­a­bil­ity den­sity for f is dis­persed, but slightly asym­met­ric. I too can say, “Well, I have an awful lot of prob­a­bil­ity mass on val­ues of f less than 0.5. I should col­lect more in­for­ma­tion to tighten this up.”

“Gee, 51% isn’t very high. I’d like to be at least 80% sure. Since I don’t know very much yet, it wouldn’t take much more to get to 80%. I should get that data so I can bet on heads with con­fi­dence.”

This mixes up f on the one hand with my dis­tri­bu­tion for f on the other. I can cer­tainly col­lect data un­til I’m 80% sure that f is big­ger than 0.5 (pro­vided that f re­ally is big­ger than 0.5). This is dis­tinct from be­ing 80% sure of get­ting heads on the next toss.

• I guess I just don’t un­der­stand the differ­ence be­tween bayesi­anism and fre­quen­tism. If I had seen your dis­cus­sion of limit­ing rel­a­tive fre­quency some­where else, I would have called it fre­quen­tist.

I think I’ll go back to bor­row­ing bits and pieces. (Thank you for some nice ones.)

• The key differ­ence is that a fre­quen­tist would not ad­mit the le­gi­t­i­macy of a dis­tri­bu­tion for f—the data are ran­dom, so they get a dis­tri­bu­tion, but f is fixed, al­though un­known. Bayesi­ans say that quan­tities that are fixed but un­known get prob­a­bil­ity dis­tri­bu­tions that en­code the in­for­ma­tion we have about them.

• Be­ing a fre­quen­tist who hangs out on a Bayesian fo­rum, I’ve thought about the differ­ence be­tween the two per­spec­tives. I think the di­chotomy is analo­gous to bot­tom-up verses top-down think­ing; nei­ther one is su­pe­rior to the other but the use­ful­ness of each waxes and wanes de­pend­ing upon the cur­rent state of a sci­en­tific field. I think we need both to de­velop any field fully.

Pos­si­bly my un­der­stand­ing of the differ­ence be­tween a fre­quen­tist and Bayesian per­spec­tive is differ­ent than yours (I am a fre­quen­tist af­ter all) so I will de­scribe what I think the differ­ence is here. I think the two POVs can definitely come to the same (true) con­clu­sions, but the al­gorithm/​thought-pro­cess feels differ­ent.

Con­sider toss­ing a fair-coin. Every­one ob­serves that on av­er­age, heads comes up 50% of the time. A fre­quen­tist sees the coin-toss­ing as a re­al­iza­tion of the ab­stract Pla­tonic truth that the coin has a 50% chance of com­ing up heads. A Bayesian, in con­trast, be­lieves that the re­al­iza­tion is the pri­mary thing … the flip­ping of the coin yields the prop­erty of hav­ing 50% prob­a­bil­ity of com­ing up heads as you flip it. So both per­spec­tives re­quire the ob­ser­va­tion of many flips to as­cer­tain that the coin is in­deed fair, but the only differ­ence be­tween the two views is that the fre­quen­tist sees the “50% prob­a­bil­ity of be­ing heads” as some­thing that ex­ists in­de­pen­dently of the flips. It’s some­thing you dis­cover rather than some­thing you cre­ate.

Seen this way, it sounds like fre­quen­tists are Pla­ton­ists and Bayesi­ans are non-Pla­ton­ists. Ab­stract math­e­mat­i­ci­ans tend to be Pla­ton­ists (but not always) and they’ve lent their bias to the field. Smart Bayesi­ans, on the other hand, tend to be more prac­ti­cal and be­come ex­per­i­men­tal­ists.

There’s definitely a cer­tain ran­kle be­tween Pla­ton­ists and non-Pla­ton­ists. Non-pla­ton­ists think that Pla­ton­ists are nuts, and Pla­ton­ists think that the non-Pla­ton­ists are too literal.

May we con­sider the hy­poth­e­sis that this differ­ence is just a differ­ence in brain hard-wiring? When a Pla­ton­ist thinks about a coin flip­ping and the prob­a­bil­ity of get­ting heads, they re­ally do per­ceive this “prob­a­bil­ity” as ex­ist­ing in­de­pen­dently. How­ever, what do they mean by “ex­ist­ing in­de­pen­dently”? We learn what words mean from ex­pe­rience. A Pla­ton­ist has ex­pe­rience of this type of per­cep­tion and knows what they mean. A non-Pla­ton­ist doesn’t know what is meant and thinks the same thing is meant as what ev­ery­one means when they say “a table ex­ists”. Th­ese types of ex­is­tence are differ­ent, but how can a Bayesian un­der­stand the Pla­tonic mean­ing with­out the Pla­tonic ex­pe­rience?

A Bayesian should just ob­serve what does ex­ist, and what words the Pla­ton­ist uses, and re­define the words to match the ex­pe­rience. This trans­la­tion must be done similarly with all fre­quen­tist math­e­mat­ics, if you are a Bayesian.

• Seen this way, it sounds like fre­quen­tists are Pla­ton­ists and Bayesi­ans are non-Pla­ton­ists.

Coun­terex­am­ple: I have a Pla­tonic view of math­e­mat­i­cal truths, but a Bayesian view of prob­a­bil­ity.

A fre­quen­tist sees the coin-toss­ing as a re­al­iza­tion of the ab­stract Pla­tonic truth that the coin has a 50% chance of com­ing up heads.

This does not make sense. For any given coin flip, ei­ther the fun­da­men­tal truth is that the coin will come up heads, or the fun­da­men­tal truth is that the coin will come up tails. The 50% prob­a­bil­ity rep­re­sents my un­cer­tainty about the fun­da­men­tal truth, which is not a prop­erty of the coin.

• Coun­terex­am­ple: I have a Pla­tonic view of math­e­mat­i­cal truths, but a Bayesian view of prob­a­bil­ity.

That’s in­ter­est­ing. I had imag­ined that peo­ple would be one way or the other about ev­ery­thing. Can any­one else provide dat­a­points on whether they are Pla­tonic about only a sub­set of things?

… in or­der to tri­an­gu­late closer to whether Pla­ton­ism is “hard-wired”, do you find it pos­si­ble to be non-Pla­tonic about math­e­mat­i­cal truths? Can some­one who is non-Pla­tonic think about them Pla­ton­i­cally—is it a choice?

For any given coin flip, ei­ther the fun­da­men­tal truth is that the coin will come up heads, or the fun­da­men­tal truth is that the coin will come up tails. The 50% prob­a­bil­ity rep­re­sents my un­cer­tainty about the fun­da­men­tal truth, which is not a prop­erty of the coin.

See, that’s just not the way a fre­quen­tist sees it. At first I no­tice, you are defin­ing “fun­da­men­tal truth” as what will ac­tu­ally hap­pen in the next coin flip. In con­trast, it is more nat­u­ral to me to think of the “fun­da­men­tal truth” as be­ing what the prob­a­bil­ity of heads is, as a prop­erty of the coin and the flip, since the out­come isn’t de­ter­mined yet. But that’s just ask­ing differ­ent ques­tions. So if the ques­tion is, what is the truth about the out­come of the next flip, we are talk­ing about em­piri­cal re­al­ity (an ex­per­i­ment) and my per­spec­tive will be more Bayesian.

• since the out­come isn’t de­ter­mined yet

The out­come is de­ter­mined time­lessly, by the prop­er­ties of the coin-toss­ing setup. It hasn’t hap­pened yet. What came be­fore the coin de­ter­mines the coin, but in turn is de­ter­mined by the stuff lo­cated fur­ther and fur­ther in the past from the ac­tual coin-toss. It is a type er­ror to speak of when the out­come is de­ter­mined.

• Whether or not the uni­verse is de­ter­minis­tic is not de­ter­mined yet. Even if you and I both think that a de­ter­minis­tic uni­verse is more log­i­cal, we should ac­cept that cer­tain figures of speech will per­sist. When I said the toss wasn’t de­ter­mined yet, I meant that the out­come of the toss was not known yet by me. I don’t see how your cor­rec­tion adds to the dis­cus­sion ex­cept pos­si­bly to make me seem naive, like I’ve never con­sid­ered the con­cept of de­ter­minism be­fore.

• what the prob­a­bil­ity of heads is, as a prop­erty of the coin and the flip

I meant that the out­come of the toss was not known yet by me

Map/​ter­ri­tory dis­tinc­tion. As a prop­erty of the ac­tual coin and flip, the prob­a­bil­ity of heads is 0 or 1 (mod­ulo some nonzero but ut­terly neg­ligible quan­tum un­cer­tainty); as a prop­erty of your state of knowl­edge, it can be 0.5.

• This com­ment helped things come into bet­ter fo­cus for me.

A fre­quen­tist be­lieves that there is a prob­a­bil­ity of flip­ping heads, as a prop­erty of the coin and (yes, cer­tainly) the con­di­tions of the flip­ping. To a fre­quen­tist, this prob­a­bil­ity is in­de­pen­dent of whether the out­come is de­ter­mined or not and is even in­de­pen­dent of what the out­come is. Con­sider the fol­low­ing se­quence of flips: H T T

A fre­quen­tist be­lieves that the prob­a­bil­ity of flip­ping heads was .5 all along right? The first ‘H’ and the sec­ond ‘T’ and the third ‘T’ were just dis­crete re­al­iza­tions of this prob­a­bil­ity.

The rea­sons why I’ve been call­ing this a Pla­tonic per­spec­tive is be­cause I think the crit­i­cal differ­ence in philos­o­phy is the fre­quen­tist idea of this non-em­piri­cal “prob­a­bil­ity’ ex­ist­ing in­de­pen­dent of re­al­iza­tions. The prob­a­bil­ity of flip­ping heads for a set of con­di­tions is .5 whether you ac­tu­ally flip the coins or not. How­ever, fre­quen­tists agree you must flip the coin to know that the prob­a­bil­ity was .5.

You might think this per­spec­tive is wrong-headed, and from a strict em­piri­cal view where you al­low no Pla­tonic en­tities/​con­cepts, it kind of is. But the ques­tion I am re­ally in­ter­ested in is the fol­low­ing: to what ex­tent is this point of view a choice we can be wrong or right about, or a per­spec­tive that some (or most?) peo­ple have hard-wired in their phys­i­cal brain? Fur­ther, how can you ar­gue that it isn’t use­ful when it demon­stra­bly has been so use­ful? Per­haps it fa­cil­i­tates or is nec­es­sary for some cat­e­gories of ab­stract thought.

• But the ques­tion I am re­ally in­ter­ested in is the fol­low­ing: to what ex­tent is this point of view a choice we can be wrong or right about, or a per­spec­tive that most peo­ple have hard-wired in their phys­i­cal brain al­gorithms?

It could be hard-wired and still be right or wrong.

• Cor­rect, gen­er­ally. But how could a per­spec­tive be wrong?

I can think of two ways a per­spec­tive can be wrong: ei­ther be­cause it (a) as­serts a fact about ex­ter­nal re­al­ity that is not true or (b) yields false con­clu­sions about the ex­ter­nal world.

(a) Fre­quen­tists don’t as­sert any­thing ex­tra about the em­piri­cal world, they as­sert the use of (and ob­sten­si­bly, the “ex­is­tence” of) some­thing sym­bolic. From the em­piri­cist per­spec­tive, it’s not re­ally there. Like a lit­tle icon float­ing above or around the ac­tual thing that your cur­sor doesn’t in­ter­act with, so it can’t be false in the em­piri­cal sense.

(b) It would be fas­ci­nat­ing if the fre­quen­tist per­spec­tive yielded false con­clu­sions,and in such a case, is there any doubt that peo­ple would de­velop and em­brace new math­e­mat­ics that avoided such er­rors? In fact, we already see this hap­pen­ing where physics at ex­treme scales seems to defy in­tu­ition. If some­one wanted to pro­pose a new the­ory of ev­ery­thing I don’t think any­one would ever crit­i­cize it on the grounds of not be­ing fre­quen­tist. I guess the point here is just that it’s use­ful or not.

Later edit: Ok, I fi­nally get it. Maybe the rea­son we don’t un­der­stand physics at the ex­treme scales is be­cause the fre­quen­tist ap­proach was evolved (hard-wired) for un­der­stand­ing in­ter­me­di­ate phys­i­cal scales and it’s (ap­par­ently) be­gin­ning to fail. You guys are us­ing em­piri­cal philos­o­phy to try and de­velop a new brand math­e­mat­ics that won’t have these in­born er­rors of in­tu­ition. So while I ar­gue that fre­quen­tism has definitely been pro­duc­tive so far, you ar­gue that it is in­trin­si­cally limited based on philo­soph­i­cal prin­ci­ples.

• A per­spec­tive can be wrong if it ar­bi­trar­ily as­signs a prob­a­bil­ity of 1 to an event that has a sym­met­ri­cal al­ter­na­tive. Read the in­tro to My Bayesian En­light­en­ment for Eliezer’s de­scrip­tion of a fre­quen­tist go­ing wrong in this way with re­spect to the prob­lem of the math­e­mat­i­cian with two chil­dren, at least one of which is a boy.

• No, Bayesian prob­a­bil­ity and or­tho­dox statis­tics give ex­actly the same an­swers if the con­text of the prob­lem is the same. The two schools may tend to have differ­ent ideas about what is a “nat­u­ral” con­text, but any good text­book will always define ex­actly what the con­text is so that there is no guess­ing and no dis­agree­ment.

Nev­er­the­less, which event with a sym­met­ri­cal al­ter­na­tive were you refer­ring to? (You are given that the women said she has at least 1 boy, so it would be cor­rect to as­sign that prob­a­bil­ity 1 in the con­text of a given as­sump­tion, ob­vi­ously when ap­ply­ing the or­tho­dox method.) Both ap­proaches work differ­ently, but they both work.

• Nev­er­the­less, which event with a sym­met­ri­cal al­ter­na­tive were you refer­ring to?

Given that the women does have a boy and a girl, what is the prob­a­bil­ity that she would state that at least one of them is a boy? By sym­me­try, you would ex­pect a pri­ori, not know­ing any­thing about this per­son’s prefer­ences, that in the same con­di­tions, she is equally likely to state that at least one of her chil­dren is a girl, so to as­sign the con­di­tional prob­a­bil­ity higher than .5 does not make sense, so it is definitely not right for the fre­quen­tist Eliezer was talk­ing with to act as though the con­di­tional prob­a­bil­ity were 1. (The case could be made that the state­ment is also ev­i­dence that the woman has a ten­dency to say at least once child is a boy rather than that at least one child is a girl. But this is a small effect, and still does not jus­tify as­sign­ing a con­di­tional prob­a­bil­ity of 1.)

I think the fre­quen­tist ap­proach could han­dle this prob­lem if ap­plied cor­rectly, but it seems that fre­quen­tist in prac­tice get it wrong be­cause they do not even con­sider the con­di­tional prob­a­bil­ity that they would ob­serve a piece of ev­i­dence if a the­ory they are con­sid­er­ing is true.

any good text­book will always define ex­actly what the con­text is so that there is no guess­ing and no dis­agree­ment.

If you read the ar­ti­cle I cited, Eliezer did ex­plain that this was a man­gling of the origi­nal prob­lem, in which the math­e­mat­i­cian made the state­ment in re­sponse to a di­rect ques­tion, so one could rea­son­ably ap­prox­i­mate that she would make the state­ment ex­actly when it is true.

How­ever, life does not always pre­sent us with neat text­book prob­lems. Some­times, the con­di­tional prob­a­bil­ities are hard to figure out. I pre­fer the ap­proach that says figure them out any­ways to the one that glosses over their im­por­tance.

• so to as­sign the con­di­tional prob­a­bil­ity higher than .5 does not make sense, so it is definitely not right for the fre­quen­tist Eliezer was talk­ing with to act as though the con­di­tional prob­a­bil­ity were 1

In the “cor­rect” for­mu­la­tion of the prob­lem (the one in which the cor­rect an­swer is 13), the fre­quen­tist tells us what the mother said as a given as­sump­tion; con­sid­er­ing the prior <1 prob­a­bil­ity of this is ren­dered ir­rele­vant be­cause we are now work­ing in the sub­set of prob­a­bil­ity space where she said that.

it seems that fre­quen­tist in prac­tice get it wrong be­cause they do not even con­sider the con­di­tional prob­a­bil­ity that they would ob­serve a piece of ev­i­dence if a the­ory they are con­sid­er­ing is true.

Con­sid­er­ing whether a the­ory is true is sci­ence—I com­pletely agree sci­ence has im­por­tant, nec­es­sary Bayesian el­e­ments.

• Con­sid­er­ing whether a the­ory is true is science

Con­sid­er­ing whether a the­ory is true is not sci­ence, al­thought the two are cer­tainly use­ful to each other.

• Giv­ing “prob­a­bly” of ac­tual out­come for the coin flip as ~1 looks like a type er­ror, al­though it’s clear what you are say­ing. It’s more like P(coin is heads|coin is heads), tau­tolog­i­cally 1, not re­ally a prob­a­bil­ity.

• Edited to clar­ify.

• As a prop­erty of the ac­tual coin and flip, the prob­a­bil­ity of heads is 0 or 1 (mod­ulo some nonzero but ut­terly neg­ligible quan­tum un­cer­tainty)

This mixes to­gether two differ­ent kinds of prob­a­bil­ity, con­fus­ing the situ­a­tion. There is noth­ing fuzzy about the events defin­ing the pos­si­ble out­comes, the fact that there is also in­dex­i­cal un­cer­tainty im­posed on your mind while it ob­serves the out­come is from a differ­ent prob­lem.

• Yeah, it just felt like too much work to add ”...ran­domly sam­pling from fu­ture Everett branches ac­cord­ing to the Born prob­a­bil­ities” or the like.

• My point is that most of the time de­ci­sion-the­o­retic prob­lems are best han­dled in a de­ter­minis­tic world.

• When I said the toss wasn’t de­ter­mined yet, I meant that the out­come of the toss was not known yet by me.

Hence it’s your un­cer­tainty, which can as well be han­dled in de­ter­minis­tic world. And in de­ter­minis­tic world, I don’t know how to parse your sentence

it is more nat­u­ral to me to think of the “fun­da­men­tal truth” as be­ing what the prob­a­bil­ity of heads is, as a prop­erty of the coin and the flip

• Can any­one else provide dat­a­points [...]

I am a Pla­ton­ist about math­e­mat­ics by in­cli­na­tion, though I strongly sus­pect that this in­cli­na­tion is one that I should re­sist tak­ing too se­ri­ously. I am a Bayesian about proa­bil­ity (at least in the fol­low­ing sense: it seems to me that the Bayesian ap­proach sub­sumes the oth­ers, when they are ap­plied cor­rectly). I am mostly Bayesian about statis­tics, but don’t see any rea­son why you shouldn’t com­pute con­fi­dence in­ter­vals and un­bi­ased es­ti­ma­tors if you want to. I don’t think “Pla­ton­ist” and “fre­quen­tist” are at all the same thing, so I don’t see any of the above as in­di­cat­ing that I’m (in­clined to be) Pla­ton­ist about some things but not about oth­ers.

[...] the fun­da­men­tal truth [...]

This seems to have prompted a de­bate about whether The Fun­da­men­tal Truth is one about the gen­eral propen­si­ties of the coin, or one about what will hap­pen the next time it’s flipped. I don’t see why there should be ex­actly one Fun­da­men­tal Truth about the coin; I’d have thought there would be ei­ther none or many de­pend­ing on what sort of thing one wishes to count as a “fun­da­men­tal truth”.

Any­way: imag­ine a pre­ci­sion robot coin-flip­per. I hope it’s clear that with such a de­vice one could ar­range that the next mil­lion flips of the coin all come up heads, and then melt it down. So what­ever “fun­da­men­tal truth” there might be about What The Coin Will Do has to be rel­a­tive to some model of what’s go­ing to be done to it. The point of coin-flip­ping is that it’s a sort of ran­dom­ness mag­nifier: small vari­a­tions in what you do to it make big­ger differ­ences to what it does, so a small patch of pos­si­bil­ity-space gets turned into a some­what-uniform sam­pling of a larger patch (cau­tion: Liou­ville, vol­ume con­ser­va­tion, etc.). And the “fun­da­men­tal truth” about the coin that you’re ap­peal­ing to is that, plus what it im­plies about its abil­ity to turn kinda-sorta-slightly-ran­dom-ish coin flip­ping ac­tions into much more ran­dom-ish out­comes. To turn that into an ac­tual ex­pec­ta­tion of (more or less) in­de­pen­dent p=1/​2 Bernoulli tri­als, you need to add some as­sump­tion about how peo­ple ac­tu­ally flip coins, and then the magic of physics means that a wide range of such as­sump­tions all lead to very similar-look­ing con­clu­sions about what the out­comes are likely to look like.

In other words: an ac­cu­rate ver­sion of the fre­quen­tist way of look­ing at the coin’s be­havi­our starts with some as­sump­tion (wher­ever it hap­pens to come from) about how coins ac­tu­ally get flipped, mixes that with some (not re­ally prob­a­bil­is­tic) facts about the coin, and ends up with a con­clu­sion about what the coin is likely to do when flipped, which doesn’t de­pend too sen­si­tively on that as­sump­tion we made.

Whereas a Bayesian way of look­ing at it starts with some as­sump­tion (wher­ever it hap­pens to come from) about what hap­pens when coins get flipped, mixes that with some (not re­ally prob­a­bil­is­tic) facts about what the coin has been ob­served to do and per­haps a bit of physics, and ends up with a con­clu­sion about what the coin is likely to do when flipped in the fu­ture, which doesn’t de­pend too sen­si­tively on that as­sump­tion we made.

Clearly the philo­soph­i­cal differ­ences here are ir­rec­on­cilable...

• … in or­der to tri­an­gu­late closer to whether Pla­ton­ism is “hard-wired”, do you find it pos­si­ble to be non-Pla­tonic about math­e­mat­i­cal truths? Can some­one who is non-Pla­tonic think about them Pla­ton­i­cally—is it a choice?

Most of the time I think about math, I do not worry about if it is pla­tonic or not. It was re­ally only in the con­text of con­sid­er­ing my epistemic un­cer­tainty that 2+2=4 that I needed con­sider the na­ture of the ter­ri­tory I was map­ping, and in this con­text it did not make sense for the ter­ri­tory to be the phys­i­cal uni­verse.

In con­trast, it is more nat­u­ral to me to think of the “fun­da­men­tal truth” as be­ing what the prob­a­bil­ity of heads is, as a prop­erty of the coin and the flip, since the out­come isn’t de­ter­mined yet.

You mean, the out­come has not been de­ter­mined by you, since you have not ob­served all the phys­i­cal prop­er­ties of coin, the per­son flip­ping it, and the en­vi­ron­ment, and calcu­lated out all the physics that would tell you whether it would land heads or tails. At­tach­ing a prob­a­bil­ity to the coin is just our way of deal­ing with the ig­no­rance and lack of com­put­ing power that pre­vents us from find­ing the ex­act an­swer.

• What is your point? You iter­ate the Bayesian per­spec­tive, but do you agree that fre­quen­tists and Bayesi­ans have differ­ent per­spec­tives about this?

I think it boils down to this: you are a fre­quen­tist (and I’ve been us­ing the term Pla­ton­ist) if you see the 50% prob­a­bil­ity as a prop­erty of the coin and the flip, and you are a Bayesian if you see the 50% prob­a­bil­ity as just a way of mea­sur­ing the un­cer­tainty.

(Given your ra­tio­nale for be­ing Pla­tonic about math­e­mat­ics, I don’t know if you are re­ally a Pla­ton­ist (in the hard-wired sense).)

• My point is that the view that 50% prob­a­bil­ity is a fun­da­men­tal prop­erty of the coin is wrong. It is an ex­am­ple of the Mind Pro­jec­tion Fal­lacy, think­ing that be­cause you don’t know the re­sult, some­how the uni­verse doesn’t ei­ther. It is cer­tainly not the case that when asked about the re­sult of a sin­gle coin flip, that giv­ing a 50% prob­a­bil­ity for heads is the best pos­si­ble an­swer. One could, in prin­ci­ple, do more in­ves­ti­ga­tion, and find that un­der the cur­rent con­di­tions, the coin will come up heads (or tails) with 99% prob­a­bil­ity, and ac­tu­ally be right 99 times out of a hun­dred.

I don’t like to call this view of the prob­a­bil­ity as a fun­da­men­tal prop­erty of the coin the fre­quen­tist view. It makes more sense to de­scribe their per­spec­tive as a the prob­a­bil­ity be­ing a com­bined prop­erty of the coin and a dis­tri­bu­tion of con­di­tions in which it could be flipped. From this per­spec­tive, the mis­take of at­tach­ing the prob­a­bil­ity to the coin is that miss the fact that you are flip­ping the coin in one par­tic­u­lar con­di­tion, which will have a definite out­come. The prob­a­bil­ity comes from un­cer­tainty of which con­di­tion from the dis­tri­bu­tion ap­plies in this case, and of course, limits on com­pu­ta­tional power.

• Are you say­ing that fre­quen­tists are wrong, or just me?

If the former, how can you say that and con­sider the case closed when fre­quen­tists ar­rive at cor­rect con­clu­sions? What I’m sug­gest­ing is that Bayesi­ans are com­mit­ting the mind pro­jec­tion fal­lacy when they as­sert that fre­quen­tists are “wrong”.

• I am say­ing that you are wrong, and I am not sure there isn’t more to the fre­quen­tist view than you are say­ing, so I am not pre­pared to figure out if it is right or wrong un­til I know more about what it is say­ing.

If the former, how can you say that and con­sider the case closed when fre­quen­tists ar­rive at cor­rect con­clu­sions?

Like in the Monty Hall prob­lem, where the fre­quen­tists will agree to the cor­rect an­swer af­ter you beat them over the head with a com­puter simu­la­tion?

What I’m sug­gest­ing is that Bayesi­ans are com­mit­ting the mind pro­jec­tion fal­lacy when they as­sert that fre­quen­tists are “wrong”.

Huh? What prop­erty of our minds do you think we are pro­ject­ing onto the ter­ri­tory?

• In the Monty Hall prob­lem, in­tu­iton tends to in­sist on the wrong an­swer, not valid ap­pli­ca­tion of fre­quen­tist the­ory.

Just cu­ri­ous—is the monty hall solu­tion in­tu­itively ob­vi­ous to a “Bayesian”, or do they also need to work through the (Bayesian) math in or­der to be con­vinced?

Huh? What prop­erty of our minds do you think we are pro­ject­ing onto the ter­ri­tory?

Oops. I meant the typ­i­cal mind fal­lacy.

• Just cu­ri­ous—is the monty hall solu­tion in­tu­itively ob­vi­ous to a “Bayesian”, or do they also need to work through the (Bayesian) math in or­der to be con­vinced?

For me at least, it is not so much that the solu­tion is in­tu­itively ob­vi­ous as that set­ting up the Bayesian math forces me to ask the im­por­tant ques­tions.

I meant the typ­i­cal mind fal­lacy.

Then how do you think we are as­sum­ing that oth­ers think like us? It seems to me that we no­tice that oth­ers are not think­ing like us, and that in this case, the differ­ent think­ing is an er­ror. I be­lieve that 2+2=4, and if I said that some­one was wrong for claiming that 2+2=3, that would not be a typ­i­cal mind fal­lacy.

• If the con­clu­sions about re­al­ity were differ­ent, then the 2+2=4 verses 2+2=3 anal­ogy would hold. In­stead, you are ob­ject­ing to the way fre­quen­tists ap­proach the prob­lem. (Some­times, the differ­ence seems to be as sub­tle as just the way they de­scribe their ap­proach.) Un­less you show that they do not as con­sis­tently ar­rive at the cor­rect an­swer, I think that ob­ject­ing to their meth­ods is the typ­i­cal mind fal­lacy.

Assert­ing that fre­quen­tists are wrong is ac­tu­ally very non-Bayesian, be­cause you have no ev­i­dence that the fre­quen­tist view is illog­i­cal. Only your in­tu­ition and logic guides you here. So fi­nally, as two ra­tio­nal­ists, we may ob­serve a bona fide differ­ence in what we con­sider in­tu­itive, nat­u­ral or log­i­cal.

I’m cu­ri­ous about the fre­quency of “nat­u­ral” Bayesi­ans and fre­quen­tists in the pop­u­la­tion, and won­der about their co-evolu­tion. I also won­der about their lack of mu­tual un­der­stand­ing.

• You have a coin. The coin is bi­ased. You don’t know which way it’s bi­ased or how much it’s bi­ased. Some­one just told you, “The coin is bi­ased” and that’s all they said. This is all the in­for­ma­tion you have, and the only in­for­ma­tion you have.

You draw the coin forth, flip it, and slap it down.

Now—be­fore you re­move your hand and look at the re­sult—are you will­ing to say that you as­sign a 0.5 prob­a­bil­ity to the coin hav­ing come up heads?

The fre­quen­tist says, “No. Say­ing ‘prob­a­bil­ity 0.5’ means that the coin has an in­her­ent propen­sity to come up heads as of­ten as tails, so that if we flipped the coin in­finitely many times, the ra­tio of heads to tails would ap­proach 1:1. But we know that the coin is bi­ased, so it can have any prob­a­bil­ity of com­ing up heads ex­cept 0.5.”

The fre­quen­tists get this ex­actly wrong, rul­ing out the only the cor­rect an­swer given their knowl­edge of the situ­a­tion.

The ar­ti­cle goes on to de­scribe sce­nar­ios in which hav­ing differ­ent par­tial knowl­edge to the situ­a­tion leads to differ­ent prob­a­bil­ities. The fre­quen­tist per­spec­tive doesn’t merely lead to the wrong an­swer for these sce­nar­ios, it fails to even pro­duce a co­her­ent anal­y­sis. Be­cause there is no sin­gle prob­a­bil­ity at­tached to the event it­self. The prob­a­bil­ity re­ally is a prop­erty of the mind an­a­lyz­ing that event, to the ex­tent that it is sen­si­tive to the par­tial knowl­edge of that mind.

• I like the re­sponse of Con­stant2:

The com­pe­tent fre­quen­tist would pre­sum­ably not be be­fud­dled by these sup­posed para­doxes. Since he would not be be­fud­dled (or so I am fairly cer­tain), the “para­doxes” fail to prove the su­pe­ri­or­ity of the Bayesian ap­proach.

Eliezer re­sponded with:

Not the last two para­doxes, no. But the first case given, the bi­ased coin whose bias is not known, is in­deed a clas­sic ex­am­ple of the differ­ence be­tween Bayesi­ans and fre­quen­tists.

and in the post he wrote

The fre­quen­tist per­spec­tive doesn’t merely lead to the wrong an­swer for these sce­nar­ios, it fails to even pro­duce a co­her­ent anal­y­sis.

But the fre­quen­tist does have a co­her­ent anal­y­sis for solv­ing this prob­lem. Be­cause we’re not ac­tu­ally in­ter­ested in the long-term prob­a­bil­ity of flip­ping heads (of which all any­one can say is that it is not .5) but the ex­pected out­come of a sin­gle flip of a bi­ased coin. This is an ex­pected value calcu­la­tion, and I’ll even ap­ply your idea about events with sym­met­ric al­ter­na­tives. (So I do not have to make any as­sump­tions about the shape of the dis­tri­bu­tion of pos­si­ble bi­ases.)

I will calcu­late my ex­pected value us­ing that the coin is bi­ased to­wards heads or it is bi­ased to­wards tails with equal prob­a­bil­ity. Let p be the prob­a­bil­ity that the coin flips to the bi­ased ori­en­ta­tion (i.e., p>.5).

• The prob­a­bil­ity of heads is p with prob­a­bil­ity of 0.5. The prob­a­bil­ity of tails in this case is (1-p)*0.5.

• The prob­a­bil­ity of heads is (1-p) with prob­a­bil­ity of 0.5. The prob­a­bil­ity of tails in this case is (p)*0.5.

Thus, the ex­pected value of heads is p .5+(1-p) 0.5 = 0.5.

So there’s no be­fud­dle­ment, only a change in ran­dom vari­ables from the long-term ex­pec­ta­tion of the out­come of many flips to the long-term ex­pec­ta­tion of whether heads or tails is preferred and a sin­gle flip. Which we should ex­pect, since the ran­dom vari­able we are re­ally be­ing asked about has changed with the differ­ent con­texts.

• You just pushed aside your no­tion of an ob­jec­tive prob­a­bil­ity and calcu­lated a sub­jec­tive prob­a­bil­ity re­flect­ing your par­tial in­for­ma­tion. Con­grat­u­la­tions, you are a Bayesian.

• I ap­plied com­pletely or­tho­dox fre­quen­tist prob­a­bil­ity.

I had pre­dicted your ob­jec­tion would be that ex­pected value is an ap­pli­ca­tion of Bayes’ the­o­rem, but I was pre­pared to ar­gue that or­tho­dox prob­a­bil­ity does in­clude Bayes’ the­o­rem. It is one of the pillars of any in­tro­duc­tory prob­a­bil­ity text­book.

A prob­lem isn’t “Bayesian” or “fre­quen­tist”. The ap­proach is. Fre­quen­tists take the pri­ors as given as­sump­tions. The as­sump­tions are in­cor­po­rated at the be­gin­ning as part of the con­text of the prob­lem, and we know the ob­jec­tive solu­tion de­pends upon (and is defined within) a given con­text. A Bayesian in con­trast, has a differ­ent per­spec­tive and doesn’t re­quire for­mal­iz­ing the pri­ors as given as­sump­tions. Ap­par­ently they are com­fortable with as­sert­ing that the pri­ors are “sub­jec­tive”. As a fre­quen­tist, I would have to say that the prob­lem is ill-posed (or un­der-de­ter­mined) to the ex­tent that the pri­ors/​as­sump­tions are re­ally sub­jec­tive.

Sup­pose that I tell you I am go­ing to pick up a card ran­domly and will ask you the prob­a­bil­ity of whether it is the ace of hearts. Your cor­rect an­swer would be 152, even if I look at the card my­self and know with prob­a­bil­ity 0 or 1 that the card is the ace of hearts. Fre­quen­tists have no prob­lem with this “sub­jec­tivity”, they un­der­stand it as differ­ent prob­a­bil­ities for differ­ent con­texts. This is mainly a re­sponse to this com­ment, but is rele­vant here.

Yet again, the mi­s­un­der­stand­ing has arisen be­cause of not un­der­stand­ing what is meant by the prob­a­bil­ity is “in” the cards. In this way, Bayesian’s in­ter­pret the fre­quen­tist’s lan­guage too liter­ally. But what does a fre­quen­tist ac­tu­ally mean? Just that the prob­a­bil­ity is ob­jec­tive? But the ob­jec­tivity re­sults from the preferred way of fram­ing the prob­lem … I’m will­ing to con­sider and have sug­gested the pos­si­bil­ity that this “Pla­tonic prob­a­bil­ity” is an ar­ti­fact of a thought pro­cess that the fre­quen­tist ex­pe­riences em­piri­cally (but men­tally).

• I’m Pla­ton­is­tic in gen­eral I sup­pose, but I see Bayesi­anism as sub­jec­tively ob­jec­tive as a Pla­ton­is­tic truth.

• As a prop­erty of the coin and the flip and the en­vi­ron­ment and the laws of physics, the prob­a­bil­ity of heads is ei­ther 0 or 1. Just be­cause you haven’t com­puted it doesn’t mean the an­swer be­comes a su­per­po­si­tion of what you might com­pute, or some­thing.

What you want is some­thing like the re­sult of tak­ing a nat­u­ral gen­er­al­iza­tion of the ex­act situ­a­tion—if the uni­verse is con­tin­u­ous and the sys­tem is chaotic enough “round to some pre­ci­sion” works—and then com­put­ing the an­swer in this pa­ram­e­ter­ized space of situ­a­tions, and then av­er­ag­ing over the pa­ram­e­ter.

The prob­lem is that “nat­u­ral gen­er­al­iza­tion” is pretty hard to define.

• Be­ing a Pla­ton­ist and a fre­quen­tist aren’t the same thing, but they cor­re­late be­cause they’re both er­rors in think­ing.

The ob­jec­tion to fre­quen­tism is that it builds the an­swer into the solu­tion so the prob­lem ac­tu­ally changes from the origi­nal real world prob­lem. This is fine as long as you can test dis­crep­an­cies be­tween the­ory and prac­tice, but that’s not always go­ing to pos­si­ble.

• “A Bayesian, in con­trast, be­lieves that the re­al­iza­tion is the pri­mary thing … the flip­ping of the coin yields the prop­erty of hav­ing 50% prob­a­bil­ity of com­ing up heads as you flip it.”

Thanks for try­ing to ex­plain the differ­ence, but I have no idea what this means.

• What I was think­ing about was this: Bayesi­ans and fre­quen­tists both agree that if a fair coin is tossed n times (where n is very large) then a string of heads and tails will re­sult and the prob­a­bil­ity of heads is .5 in some way re­lated to the fact that the num­ber of heads di­vided by n will ap­proach .5 for large n.

In my mind, the fre­quen­tist per­spec­tive is that the .5 prob­a­bil­ity of get­ting heads ex­ists first, and then the string of heads and tails re­al­ize (i.e., make a phys­i­cal man­i­fes­ta­tion of) this ab­stract prob­a­bil­ity lurk­ing in the back­ground. As though there is a bin of heads and tails some­where with ex­actly a 1:1 ra­tio and each flip picks ran­domly from this bin. The Bayesian per­spec­tive is that there is noth­ing but the string of heads and tails—only the string ex­ists, there’s no ab­stract prob­a­bil­ity that the string is a re­al­iza­tion of. No pick­ing from a bin in the sky. In­spect­ing the string, a Bayesian can calcu­late the 0.5 prob­a­bil­ity … so the 0.5 prob­a­bil­ity re­sults from the string. So ac­cord­ing to me, the philo­soph­i­cal de­bate boils down to: what comes first, the prob­a­bil­ity or the string?

I definitely get the im­pres­sion that the Bayesi­ans in this thread are skep­ti­cal of this de­scrip­tion of the differ­ence, and seem to pre­fer de­scribing the differ­ence of the Bayesian view as con­sid­er­ing prob­a­bil­ity a mea­sure of your un­cer­tainty. How­ever, prob­a­bil­ity is also taught as a mea­sure of un­cer­tainty in clas­si­cal prob­a­bil­ity, so I’m skep­ti­cal of this di­chotomy. (In fa­vor of my view, the name “fre­quen­tist” comes from the ob­ser­va­tion that they be­lieve in a no­tion of “fre­quency”—i.e., that there’s a hy­po­thet­i­cal dis­tri­bu­tion “out there” that ob­served data is be­ing sam­pled from.)

Per­haps the differ­ence in whether the cor­rect ap­proach is sub­jec­tive or ob­jec­tive bet­ter gets to the heart of the differ­ence. I am lean­ing to­wards this hy­poth­e­sis be­cause I can see how a fre­quen­tist can con­fuse some­thing be­ing ob­jec­tive with that some­thing hav­ing an in­de­pen­dent “ex­is­tence”.

• I have a lit­tle difficulty with the no­tion that the prob­a­ble out­come of a coin toss is the re­sult of the toss, rather like the col­lapse of a quan­tum prob­a­bil­ity into re­al­ity when ob­served. Look­ing at the coin be­fore the toss, surely three prob­a­bil­ities may be ob­jec­tively ob­served - H, T or E, and the like­li­hood of the coin com­ing to rest on its edge dis­missed.

Since the coin MUST then end up H or T ; the sum of both prob­a­bil­ities is 1, both out­comes are a pri­ori equally likely and have the value1/​2 be­fore the toss. Whether one chooses to be­lieve that the a pri­ori prob­a­bil­ities have ac­tual ex­is­tence is a meta­phys­i­cal is­sue.