# How Much Evidence Does It Take?

Pre­vi­ously, I defined ev­i­dence as “an event en­tan­gled, by links of cause and effect, with what­ever you want to know about,” and en­tan­gled as “hap­pen­ing differ­ently for differ­ent pos­si­ble states of the tar­get.” So how much en­tan­gle­ment—how much ra­tio­nal ev­i­dence—is re­quired to sup­port a be­lief?

Let’s start with a ques­tion sim­ple enough to be math­e­mat­i­cal: How hard would you have to en­tan­gle your­self with the lot­tery in or­der to win? Sup­pose there are sev­enty balls, drawn with­out re­place­ment, and six num­bers to match for the win. Then there are 131,115,985 pos­si­ble win­ning com­bi­na­tions, hence a ran­domly se­lected ticket would have a 1131,115,985 prob­a­bil­ity of win­ning (0.0000007%). To win the lot­tery, you would need ev­i­dence se­lec­tive enough to visi­bly fa­vor one com­bi­na­tion over 131,115,984 al­ter­na­tives.

Sup­pose there are some tests you can perform which dis­crim­i­nate, prob­a­bil­is­ti­cally, be­tween win­ning and los­ing lot­tery num­bers. For ex­am­ple, you can punch a com­bi­na­tion into a lit­tle black box that always beeps if the com­bi­na­tion is the win­ner, and has only a 14 (25%) chance of beep­ing if the com­bi­na­tion is wrong. In Bayesian terms, we would say the like­li­hood ra­tio is 4 to 1. This means that the box is 4 times as likely to beep when we punch in a cor­rect com­bi­na­tion, com­pared to how likely it is to beep for an in­cor­rect com­bi­na­tion.

There are still a whole lot of pos­si­ble com­bi­na­tions. If you punch in 20 in­cor­rect com­bi­na­tions, the box will beep on 5 of them by sheer chance (on av­er­age). If you punch in all 131,115,985 pos­si­ble com­bi­na­tions, then while the box is cer­tain to beep for the one win­ning com­bi­na­tion, it will also beep for 32,778,996 los­ing com­bi­na­tions (on av­er­age).

So this box doesn’t let you win the lot­tery, but it’s bet­ter than noth­ing. If you used the box, your odds of win­ning would go from 1 in 131,115,985 to 1 in 32,778,997. You’ve made some progress to­ward find­ing your tar­get, the truth, within the huge space of pos­si­bil­ities.

Sup­pose you can use an­other black box to test com­bi­na­tions twice, in­de­pen­dently. Both boxes are cer­tain to beep for the win­ning ticket. But the chance of a box beep­ing for a los­ing com­bi­na­tion is 14 in­de­pen­dently for each box; hence the chance of both boxes beep­ing for a los­ing com­bi­na­tion is 116. We can say that the cu­mu­la­tive ev­i­dence, of two in­de­pen­dent tests, has a like­li­hood ra­tio of 16:1. The num­ber of los­ing lot­tery tick­ets that pass both tests will be (on av­er­age) 8,194,749.

Since there are 131,115,985 pos­si­ble lot­tery tick­ets, you might guess that you need ev­i­dence whose strength is around 131,115,985 to 1—an event, or se­ries of events, which is 131,115,985 times more likely to hap­pen for a win­ning com­bi­na­tion than a los­ing com­bi­na­tion. Ac­tu­ally, this amount of ev­i­dence would only be enough to give you an even chance of win­ning the lot­tery. Why? Be­cause if you ap­ply a filter of that power to 131 mil­lion los­ing tick­ets, there will be, on av­er­age, one los­ing ticket that passes the filter. The win­ning ticket will also pass the filter. So you’ll be left with two tick­ets that passed the filter, only one of them a win­ner. Fifty per­cent odds of win­ning, if you can only buy one ticket.

A bet­ter way of view­ing the prob­lem: In the be­gin­ning, there is 1 win­ning ticket and 131,115,984 los­ing tick­ets, so your odds of win­ning are 1:131,115,984. If you use a sin­gle box, the odds of it beep­ing are 1 for a win­ning ticket and 0.25 for a los­ing ticket. So we mul­ti­ply 1:131,115,984 by 1:0.25 and get 1:32,778,996. Ad­ding an­other box of ev­i­dence mul­ti­plies the odds by 1:0.25 again, so now the odds are 1 win­ning ticket to 8,194,749 los­ing tick­ets.

It is con­ve­nient to mea­sure ev­i­dence in bits—not like bits on a hard drive, but math­e­mat­i­cian’s bits, which are con­cep­tu­ally differ­ent. Math­e­mat­i­cian’s bits are the log­a­r­ithms, base 12, of prob­a­bil­ities. For ex­am­ple, if there are four pos­si­ble out­comes A, B, C, and D, whose prob­a­bil­ities are 50%, 25%, 12.5%, and 12.5%, and I tell you the out­come was “D,” then I have trans­mit­ted three bits of in­for­ma­tion to you, be­cause I in­formed you of an out­come whose prob­a­bil­ity was 18.

It so hap­pens that 131,115,984 is slightly less than 2 to the 27th power. So 14 boxes or 28 bits of ev­i­dence—an event 268,435,456:1 times more likely to hap­pen if the ticket-hy­poth­e­sis is true than if it is false—would shift the odds from 1:131,115,984 to 268,435,456:131,115,984, which re­duces to 2:1. Odds of 2 to 1 mean two chances to win for each chance to lose, so the prob­a­bil­ity of win­ning with 28 bits of ev­i­dence is 23. Ad­ding an­other box, an­other 2 bits of ev­i­dence, would take the odds to 8:1. Ad­ding yet an­other two boxes would take the chance of win­ning to 128:1.

So if you want to li­cense a strong be­lief that you will win the lot­tery—ar­bi­trar­ily defined as less than a 1% prob­a­bil­ity of be­ing wrong—34 bits of ev­i­dence about the win­ning com­bi­na­tion should do the trick.

In gen­eral, the rules for weigh­ing “how much ev­i­dence it takes” fol­low a similar pat­tern: The larger the space of pos­si­bil­ities in which the hy­poth­e­sis lies, or the more un­likely the hy­poth­e­sis seems a pri­ori com­pared to its neigh­bors, or the more con­fi­dent you wish to be, the more ev­i­dence you need.

You can­not defy the rules; you can­not form ac­cu­rate be­liefs based on in­ad­e­quate ev­i­dence. Let’s say you’ve got 10 boxes lined up in a row, and you start punch­ing com­bi­na­tions into the boxes. You can­not stop on the first com­bi­na­tion that gets beeps from all 10 boxes, say­ing, “But the odds of that hap­pen­ing for a los­ing com­bi­na­tion are a mil­lion to one! I’ll just ig­nore those ivory-tower Bayesian rules and stop here.” On av­er­age, 131 los­ing tick­ets will pass such a test for ev­ery win­ner. Con­sid­er­ing the space of pos­si­bil­ities and the prior im­prob­a­bil­ity, you jumped to a too-strong con­clu­sion based on in­suffi­cient ev­i­dence. That’s not a pointless bu­reau­cratic reg­u­la­tion; it’s math.

Of course, you can still be­lieve based on in­ad­e­quate ev­i­dence, if that is your whim; but you will not be able to be­lieve ac­cu­rately. It is like try­ing to drive your car with­out any fuel, be­cause you don’t be­lieve in the fuddy-duddy con­cept that it ought to take fuel to go places. Wouldn’t it be so much more fun, and so much less ex­pen­sive, if we just de­cided to re­peal the law that cars need fuel?

Well, you can try. You can even shut your eyes and pre­tend the car is mov­ing. But re­ally ar­riv­ing at ac­cu­rate be­liefs re­quires ev­i­dence-fuel, and the fur­ther you want to go, the more fuel you need.

• An­ders, I’m not sure I’d agree with that, be­cause of pub­li­ca­tion bias. I’d feel much bet­ter about a sin­gle ex­per­i­ment that re­ported p < 0.001 than three ex­per­i­ments that re­ported p < 0.05.

• I’d be happy to buy lots of lot­tery tick­ets that had a 1132 chance of win­ning, given the typ­i­cal pay­off struc­ture of lot­ter­ies of the kind you de­scribe.

To act ra­tio­nally, it isn’t enough to ar­rive at the cor­rect (prob­a­bil­ities of) be­liefs; to act on a be­lief, the de­gree of be­lief you need in it might not be very great.

Given the strong ten­dency to col­lapse all de­grees of be­lief into a two-point scale (yea or nay) , I sus­pect that our in­tu­itions about how much one has to be­lieve in some­thing in or­der to act ac­cord­ingly are of­ten too stringent, since the ac­tual strengths of our be­liefs are so of­ten much too large.

(Note: “of­ten” doesn’t mean “always” or even “usu­ally”.)

• Of course act­ing on be­liefs is a de­ci­sion the­ory mat­ter. You don’t have ter­ribly much to lose by buy­ing a los­ing lot­tery ticket, but you have a very large amount to gain if it wins, so yes 1132 chance of win­ning sounds well worth $20 or so. • Sorry, ig­nore my er­ra­tum above, I was wrong. I mixed up odds and prob­a­bil­ity, they are differ­ent things. • This also shows why in­de­pen­dently repli­cated sci­en­tific ex­per­i­ments (more in­de­pen­dent boxes) are more im­por­tant than ex­per­i­ments with high p-val­ues (boxes with bet­ter like­liehood ra­tios). • But the p-val­ues go ex­po­nen­tially close to one with the size of the study. If you had three stud­ies that used 11 boxes, vs. one with 33, you’d get ex­actly the same pos­te­rior prob­a­bil­ity for the ticket be­ing a win­ner. In other words, more ex­per­i­ments are ex­po­nen­tially more valuable than higher p-val­ues, but higher p-val­ues are ex­po­nen­tially cheaper. • It is con­ve­nient to mea­sure ev­i­dence in bits—not like bits on a hard drive, but math­e­mat­i­cian’s bits, which are con­cep­tu­ally differ­ent. Math­e­mat­i­cian’s bits are the log­a­r­ithms, base 12, of prob­a­bil­ities. For ex­am­ple, if there are four pos­si­ble out­comes A, B, C, and D, whose prob­a­bil­ities are 50%, 25%, 12.5%, and 12.5%, and I tell you the out­come was “D”, then I have trans­mit­ted three bits of in­for­ma­tion to you, be­cause I in­formed you of an out­come whose prob­a­bil­ity was 18. Here you say that bits = log(P(E|H)/​P(E)). Every­where else, you used bits = log(P(E|H)/​P(E|!H)). They’re very differ­ent. • Com­pare to this com­plaint heard in a fic­ti­tious physics class­room: “Now you say joules = 12 m v^2. But ear­lier you said joules = G m1 m2 /​ r and next you are go­ing to say joules = m c^2. They are very differ­ent.” • In the ex­am­ple I cited, P(I tell you out­come is D | out­come is D) = 1 and P(I tell you out­come is D | out­come is not D) = 0 (roughly). Thus log(P(E|H)/​P(E)) = 3 and log(P(E|H)/​P(E|!H)) = in­finity. Log is base 12. Prob­a­bil­ity-bits and Odds-ra­tio-bits re­ally are very differ­ent units, and Eliezer con­fus­ingly de­scribed them as the same thing. They are not in­ter­changable like 12 m v^2, G m1 m2 /​ r, and m c^2. • I may be miss­ing some­thing here (and the karma vot­ing pat­terns sug­gest that I am). But I will re­peat my claim—per­haps with more clar­ity: Bits are bits, just as joules are joules. But just as you can use joules as a unit to quan­tify differ­ent kinds of en­ergy (ki­netic, po­ten­tial, rel­a­tivis­tic), you can use bits as a unit to quan­tify differ­ent kinds of in­for­ma­tion (log odds-ra­tio, log like­li­hood ra­tio, chan­nel ca­pac­ity (in some fixed amount of time), en­tropy of a mes­sage source. Each of these kinds of in­for­ma­tion is mea­sured in the same unit—bits. You can mea­sure ev­i­dence in bits, and you can mea­sure the in­for­ma­tion con­tent of the an­swer to a ques­tion in bits. The two are calcu­lated us­ing differ­ent for­mu­las, be­cause they are differ­ent things. Just as po­ten­tial and ki­netic en­ergy are differ­ent things. • You are cor­rect that bits can be used to mea­sure differ­ent things. The prob­lem here is that prob­a­bil­ities and odds ra­tios de­scribe the ex­act same thing in differ­ent ways. A joule of po­ten­tial en­ergy is not the same thing as a joule of ki­netic en­ergy, but they can be con­verted to each other at a 1:1 ra­tio. A prob­a­bil­ity-bit mea­sures the same thing as an odds-ra­tio-bit, but is a differ­ent quan­tity (a prob­a­bil­ity-bit is always greater than 1 odds-ra­tio-bit, and can be up to in­finity odds-ra­tio-bits). A “bit of ev­i­dence” does not un­am­bigu­ously tell some­one whether you mean prob­a­bil­ity-bit or odds-ra­tio-bit, and Eliezer does not dis­t­in­guish be­tween them prop­erly. 1 prob­a­bil­ity bit in fa­vor of a hy­poth­e­sis gives you a pos­te­rior prob­a­bil­ity of 1/​2^(n-1) from a prior of 1/​2^n. n prob­a­bil­ity bits gives you a pos­te­rior of 1 from the same prior. 1 odds ra­tio bit in fa­vor of a hy­poth­e­sis gives you a pos­te­rior odds ra­tio of 1:2^(n-1) from a prior of 1:2^n. n prob­a­bil­ity bits gives you a pos­te­rior odds ra­tio of 1:1 (prob­a­bil­ity 12) from the same prior. It takes in­finity prob­a­bil­ity bits to give you a pos­te­rior prob­a­bil­ity of 1. As the prior prob­a­bil­ity ap­proaches 0, the types of bits be­come in­ter­change­able. • Clearly you un­der­stand me now, and I think that I un­der­stand you. A “bit of ev­i­dence” does not un­am­bigu­ously tell some­one whether you mean prob­a­bil­ity-bit or odds-ra­tio-bit, and Eliezer does not dis­t­in­guish be­tween them prop­erly. OK, if what is at is­sue here is whether Eliezer was suffi­ciently clear, then I’ll bow out. Ob­vi­ously, he was not suffi­ciently clear from your view­point. I will say, though, that your com­ment is the first time I have seen the word “ev­i­dence” used by a Bayesian for any­thing other than a log odds ra­tio. Log odds ev­i­dence has the virtue that it is ad­di­tive (when in­de­pen­dent). On the other hand, your idea of a log prob­a­bil­ity mean­ing of ‘ev­i­dence’ has the virtue that a ques­tion can be de­cided by a finite amount of ev­i­dence. • I will say, though, that your com­ment is the first time I have seen the word “ev­i­dence” used by a Bayesian for any­thing other than a log odds ra­tio. Eliezer used it to mean log prob­a­bil­ity in the sec­tion that I quoted. That was what I was com­plain­ing about. • Ok, I think you are mis­in­ter­pret­ing, but I see what you mean. When EY writes: ...I have trans­mit­ted three bits of in­for­ma­tion to you, be­cause I in­formed you of an out­come whose prob­a­bil­ity was 18. I take this as illus­trat­ing the defi­ni­tion of bits in gen­eral, rather than bits of ‘ev­i­dence’. But, yes, I agree with you now that plac­ing that ex­pla­na­tion in a para­graph with that lead sen­tence promis­ing a defi­ni­tion of ‘ev­i­dence’ - well it definitely could have been writ­ten more clearly. • 131,115,985 to 1 [...] this amount of ev­i­dence would only be enough to give you an even chance of win­ning the lot­tery. The num­ber of false bleeps is dis­tributed al­most ex­actly Pois­son with $\lambda=1$. The im­por­tant figure is not the ex­pected num­ber of bleeps ($\mathbf Ex+1$, which is in­deed 2). It’s the ex­pected prob­a­bil­ity that a ran­dom bleep is the true one, $\mathbf E\tfrac1{x+1}$. At the mo­ment I can’t find an an­a­lytic solu­tion (and a short search sug­gests none is known), but a com­pu­ta­tion shows the re­sult is around 63.2%, much bet­ter than 50%. Similarly, with 14 boxes (ar­guably “28 bits of ev­i­dence”), the chance of win­ning is about 79.1% on av­er­age, much bet­ter than $\tfrac23$. • The lot­tery is a good ex­am­ple, but the large num­bers make it hard to fol­low the math with­out a calcu­la­tor. Is there a sim­pler ex­am­ple you could add with lower num­bers that we can hold in our heads? • Byrnema hosted an IRC Meet­ing about this post and I up­loaded a tran­script of the con­ver­sa­tion on the wiki. If this was the wrong place to put the tran­script let me know and I will move it. The con­ver­sa­tion went pretty well, in my opinion, and we plan on hav­ing a similar one next week. • Yes, pub­li­ca­tion bias mat­ters. But it also ap­plies to the p<0.001 ex­per­i­ment—if we have just a sin­gle pub­li­ca­tion, should we be­lieve that the effect is true and just one group has done the ex­per­i­ment, or that the effect is false and pub­li­ca­tion bias has pre­vented the pub­li­ca­tion of the nega­tive re­sults? If we had a few ex­per­i­ments (even with differ­ent re­sults) it would be eas­ier to es­ti­mate this than in the one pub­lished ex­per­i­ment case. • Lets do a check. As­sume a worst case sce­nario where no­body pub­lishes false re­sults at all. To get three p < 0.05 stud­ies if the hy­poth­e­sis is false re­quires on av­er­age 60 ex­per­i­ments. This is a lot but is within the realms of pos­si­bil­ity if the is­sue is one which many peo­ple are in­ter­ested in, so there is still grounds for scep­ti­cism of this re­sult. To get one p < 0.001 study if the hy­poth­e­sis is false re­quires on av­er­age 1000 ex­per­i­ments. This is pretty im­plau­si­ble, so I would be much hap­pier to treat this re­sult as an in­dis­putable fact, even in a field with many vested in­ter­ests (as­sum­ing ev­ery­thing else about the ex­per­i­ment is sound). • Run­ning “1000 ex­per­i­ments” if you don’t have to pub­lish nega­tive re­sults, can mean just slic­ing data un­til you find some­thing. Some­one with a large data set can just do this 100% of the time. A repli­ca­tion is more in­for­ma­tive, be­cause it’s not sub­ject to nearly as much “find some­thing new and pub­lish it” bias. • This is as­sum­ing proper method­ol­ogy and statis­tics so that the p-value ac­tu­ally matches the chance of the re­sult aris­ing by chance. In prac­tice, since even your best judg­ment of the method­ol­ogy is not go­ing to ac­count for cer­tainty in the sound­ness of the ex­per­i­ment, I would say that a p-value of 0.001 con­sti­tutes con­sid­er­ably less than 10 bits of ev­i­dence, be­cause the odds that some­thing was wrong with the ex­per­i­ment are bet­ter than the odds that the re­sults were co­in­ci­den­tal. Mul­ti­ple ex­per­i­ments with lower cu­mu­la­tive p-value can still be stronger ev­i­dence if they all make ad­just­ments to ac­count for pos­si­ble sources of er­ror. • To get one p < 0.0001 study if the hy­poth­e­sis is false re­quires on av­er­age 1000 experiments One too many ze­ros in the p value there. The 1,000 figure matches p<0.001, which is also what An­ders men­tioned. (So your point is fine.) • Thanks • Let’s say you’ve got 10 boxes lined up in a row, and you start punch­ing com­bi­na­tions into the boxes. You can­not stop on the first com­bi­na­tion that gets beeps from all 10 boxes, say­ing, “But the odds of that hap­pen­ing for a los­ing com­bi­na­tion are a mil­lion to one! I’ll just ig­nore those ivory-tower Bayesian rules and stop here.” On av­er­age, 131 los­ing tick­ets will pass such a test for ev­ery winner Huh? • just to be clear for my sake, the log_2 of the likely-hood ra­tio is how many bits that piece of ev­i­dence is worth? edit: should I take no one cor­rect­ing me as no one know­ing, or be­ing right? • Maybe I’m con­fused, but isn’t log_2(131,115,984) about 26.9, and not greater than 27? • i thought the same, did some­one ever replied? • You need *at least* 26.9 bits. Since the boxes he talked about provide 2 bits each, you need 14 boxes to get *at least* 26.9 bits (13 boxes would only be 26 bits, not enough). 14 boxes hap­pens to be 28 bits. • Ok I see, so do you always just add one bit? • Er­ra­tum: In the be­gin­ning, there is 1 win­ning ticket and 131,115,984 los­ing tick­ets, so your odds of win­ning are 1:131,115,984. Cor­rect 1:131,115,985. (five as the last digit) • You un­for­tu­nately for­got to men­tion the cost of a ticket in the lot­tery and the pay­out in the lot­tery. If the pay­out is high enough that the ex­pected pay­out of the ticket is greater or equal to the cost of the ticket then the lot­tery make sense to play. Since each ticket in that case has a pay­out equal or greater then its cost it makes sense to buy up all of the pos­si­ble com­bi­na­tions to en­sure a win. • He’s talk­ing about episte­mol­ogy, not de­ci­sion the­ory. De­ci­sion the­ory de­pends on a whole host of fac­tors other than the prob­a­bil­ity of the de­sired out­come. I would buy a$1 lot­tery ticket if it were clear that it rep­re­sented a 18,194,749 chance of win­ning $131,115,985. Episte­molog­i­cally, how­ever, I would be as­ton­ished if some­thing hap­pened be­sides me be­ing$1 poorer.