Frequentist Magic vs. Bayesian Magic

[I posted this to open thread a few days ago for re­view. I’ve only made some minor ed­i­to­rial changes since then, so no need to read it again if you’ve already read the draft.]

This is a be­lated re­ply to cousin_it’s 2009 post Bayesian Flame, which claimed that fre­quen­tists can give cal­ibrated es­ti­mates for un­known pa­ram­e­ters with­out us­ing pri­ors:

And here’s an ul­tra-short ex­am­ple of what fre­quen­tists can do: es­ti­mate 100 in­de­pen­dent un­known pa­ram­e­ters from 100 differ­ent sam­ple data sets and have 90 of the es­ti­mates turn out to be true to fact af­ter­ward. Like, fo’real. Always 90% in the long run, truly, ir­re­vo­ca­bly and for­ever.

And in­deed they can. Here’s the sim­plest ex­am­ple that I can think of that illus­trates the spirit of fre­quen­tism:

Sup­pose there is a ma­chine that pro­duces bi­ased coins. You don’t know how the ma­chine works, ex­cept that each coin it pro­duces is ei­ther bi­ased to­wards heads (in which case each toss of the coin will land heads with prob­a­bil­ity .9 and tails with prob­a­bil­ity .1) or to­wards tails (in which case each toss of the coin will land tails with prob­a­bil­ity .9 and heads with prob­a­bil­ity .1). For each coin, you get to ob­serve one toss, and then have to state whether you think it’s bi­ased to­wards heads or tails, and what is the prob­a­bil­ity that’s the right an­swer.

Let’s say that you de­cide to fol­low this rule: af­ter ob­serv­ing heads, always an­swer “the coin is bi­ased to­wards heads with prob­a­bil­ity .9” and af­ter ob­serv­ing tails, always an­swer “the coin is bi­ased to­wards tails with prob­a­bil­ity .9″. Do this for a while, and it will turn out that 90% of the time you are right about which way the coin is bi­ased, no mat­ter how the ma­chine ac­tu­ally works. The ma­chine might always pro­duce coins bi­ased to­wards heads, or always to­wards tails, or de­cide based on the digits of pi, and it wouldn’t mat­ter—you’ll still be right 90% of the time. (To ver­ify this, no­tice that in the long run you will an­swer “heads” for 90% of the coins ac­tu­ally bi­ased to­wards heads, and “tails” for 90% of the coins ac­tu­ally bi­ased to­wards tails.) No pri­ors needed! Magic!

What is go­ing on here? There are a cou­ple of things we could say. One was men­tioned by Eliezer in a com­ment:

It’s not perfectly re­li­able. They as­sume they have perfect in­for­ma­tion about ex­per­i­men­tal se­tups and like­li­hood ra­tios. (Where does this perfect knowl­edge come from? Can Bayesi­ans get their pri­ors from the same source?)

In this ex­am­ple, the “perfect in­for­ma­tion about ex­per­i­men­tal se­tups and like­li­hood ra­tios” is the in­for­ma­tion that a bi­ased coin will land the way it’s bi­ased with prob­a­bil­ity .9. I think this is a valid crit­i­cism, but it’s not com­plete. There are per­haps many situ­a­tions where we have much bet­ter in­for­ma­tion about ex­per­i­men­tal se­tups and like­li­hood ra­tios than about the mechanism that de­ter­mines the un­known pa­ram­e­ter we’re try­ing to es­ti­mate. This crit­i­cism leaves open the ques­tion of whether it would make sense to give up Bayesi­anism for fre­quen­tism in those situ­a­tions.

The other thing we could say is that while the fre­quen­tist in this ex­am­ple ap­pears to be perfectly cal­ibrated, he or she is li­able to pay a heavy cost for this in ac­cu­racy. For ex­am­ple, sup­pose the ma­chine is ac­tu­ally set up to always pro­duce head-bi­ased coins. After ob­serv­ing the coin tosses for a while, a typ­i­cal in­tel­li­gent per­son, just ap­ply­ing com­mon sense, would no­tice that 90% of the tosses come up heads, and in­fer that per­haps all the coins are bi­ased to­wards heads. They would be­come more cer­tain of this with time, and ad­just their an­swers ac­cord­ingly. But the fre­quen­tist would not (or isn’t sup­posed to) no­tice this. He or she would an­swer “the coin is head-bi­ased with prob­a­bil­ity .9” 90% of the time, and “the coin is tail-bi­ased with prob­a­bil­ity .9″ 10% of the time, and keep do­ing this, ir­re­vo­ca­bly and for­ever.

The fre­quen­tist magic turns out to be weaker than it first ap­peared. What about the Bayesian solu­tion to this prob­lem? Well, we know that it must in­volve a prior, so the only ques­tion is which one. The max­i­mum en­tropy prior that is con­sis­tent with the in­for­ma­tion given in the prob­lem state­ment is to as­sign each coin an in­de­pen­dent prob­a­bil­ity of .5 of be­ing head-bi­ased, and .5 of be­ing tail-bi­ased. It turns out that a Bayesian us­ing this prior will give the ex­act same an­swers as the fre­quen­tist, so this is also an ex­am­ple of a “match­ing prior”. (To ver­ify: P(bi­ased heads | ob­served heads) = P(OH|BH)*P(BH)/​P(OH) = .9*.5/​.5 = .9)

But a Bayesian can do much bet­ter. A Bayesian can use a uni­ver­sal prior. (With a uni­ver­sal prior based on a uni­ver­sal Tur­ing ma­chine, the prior prob­a­bil­ity that the first 4 coins will be bi­ased “heads, heads, tails, tails” is the prob­a­bil­ity that the UTM will pro­duce 1100 as the first 4 bits of its out­put, when given a uniformly ran­dom in­put tape.) Us­ing such a prior guaran­tees that no mat­ter how the coin-pro­duc­ing ma­chine works, as long as it doesn’t in­volve some kind of un­com­putable physics, in the long run your ex­pected to­tal Bayes score will be no worse than some­one who knows ex­actly how the ma­chine works, ex­cept by a con­stant (that’s de­ter­mined by the al­gorith­mic com­plex­ity of the ma­chine). And un­less the ma­chine ac­tu­ally set­tles into de­cid­ing the bias of each coin in­de­pen­dently with 5050 prob­a­bil­ities, your ex­pected Bayes score will also be bet­ter than the fre­quen­tist (or a Bayesian us­ing the match­ing prior) by an un­bounded mar­gin as time goes to in­finity.

I con­sider this magic also, be­cause I don’t re­ally un­der­stand why it works. Is our prior ac­tu­ally a uni­ver­sal prior, or is the uni­ver­sal prior just a handy ap­prox­i­ma­tion that we can sub­sti­tute in place of the real prior? Why does the uni­verse that we live in look like a gi­ant com­puter? What about un­com­putable physics? Just what are pri­ors, any­way? Th­ese are some of the ques­tions that I’m still con­fused about.

But as long as we’re choos­ing be­tween differ­ent mag­ics, why not pick the stronger one?