The dangers of zero and one

Eliezer wrote a post warn­ing against un­re­al­is­ti­cally con­fi­dent es­ti­mates, in which he ar­gued that you can’t be 99.99% sure that 53 is prime. Chris Hal­lquist replied with a post ar­gu­ing that you can.

That par­tic­u­lar case is tricky. There have been many in­de­pen­dent calcu­la­tions of the first hun­dred prime num­bers. 53 is a small enough num­ber that I think some­one would no­tice if Wikipe­dia in­cluded it er­ro­neously. But can you be 99.99% con­fi­dent that 1159 is a prime? You found it in one par­tic­u­lar source. Can you trust that source? It’s large enough that no one would no­tice if it were wrong. You could try to ver­ify it, but if I write a Perl or C++ pro­gram, I can’t even be 99.9% sure that the com­piler or in­ter­preter will in­ter­pret it cor­rectly, let alone that the pro­gram is cor­rect.

Rather than ar­gue over the num­ber of nines to use for a spe­cific case, I want to em­pha­size the the im­por­tance of not as­sign­ing things prob­a­bil­ity zero or one. Here’s a real case where ap­prox­i­mat­ing 99.9999% con­fi­dence as 100% had dis­as­trous con­se­quences.

I de­vel­oped a new gene-caller for JCVI. Genes are in­ter­preted in units of 3 DNA nu­cleotides called codons. A bac­te­rial gene starts with a start codon (usu­ally ATG, TTG, or GTG) and ends at the first stop codon (usu­ally TAG, TGA, or TAA). Most such se­quences are not genes. A gene-caller is a com­puter pro­gram that takes a DNA se­quence and guesses which of them are genes.

The first thing I tried was to cre­ate a sec­ond-or­der Markov model on codons, and train it on all of the large pos­si­ble genes in the genome. (Long se­quences with­out stop codons are un­likely to oc­cur by chance and are prob­a­bly genes.) That means that you set P = 1 and go down the se­quence of each large pos­si­ble gene, codon by codon, mul­ti­ply­ing P by the prob­a­bil­ity of see­ing each of the 64 pos­si­ble codons in the third po­si­tion given the codons in the first and sec­ond po­si­tions. Then I cre­ated a sec­ond Markov model from the en­tire genome. This took about one day to write, and plug­ging these two mod­els into Bayes’ law as shown be­low turned out to work bet­ter than all the other sin­gle-method gene-pre­dic­tion al­gorithms de­vel­oped over the past 30 years.

But what prob­a­bil­ity should you as­sign to a codon se­quence that you’ve never seen? A bac­te­rial genome might have 4 mil­lion base pairs, about half of which are in long pos­si­ble genes and will be used for train­ing. That means your train­ing data for one genome has about 2 mil­lion codon triplets. Sur­pris­ingly, a lit­tle less than half of all pos­si­ble codon triplets do not oc­cur at all in that data (DNA se­quences are not ran­dom). What prob­a­bil­ity do you as­sign to an event that oc­curs zero times out of 2 mil­lion?

This came up re­cently in an on­line ar­gu­ment. Another per­son said that, if the prob­a­bil­ity that X is true is be­low your de­tec­tion thresh­old or your digits of ac­cu­racy, you should as­sign P(X) = 0, since any other num­ber is just made up.

Well, I’d already em­piri­cally de­ter­mined whether that was true for the gene caller. First, due to a cod­ing er­ror, I as­signed such events P(X) = 1 /​ (64^3 * size of train­ing set), which is too small by about 64^3. Next I tried P(X) = 0.5 /​ (size of train­ing set), which is ap­prox­i­mately cor­rect. Fi­nally I tried P(X) = 0. I tested the re­sults on genomes where I had strong ev­i­dence for what where and were not genes.

How well do you think each P(X) worked?

The two non-zero prob­a­bil­ities gave nearly the same re­sults, de­spite differ­ing by 6 or­ders of mag­ni­tude. But us­ing P(X) = 0 caused the gene-caller to miss hun­dreds of genes per genome, which is a dis­as­trous re­sult. Why?

Any par­tic­u­lar codon triplet that was never found in the train­ing set would have a prior of less than one in 4 mil­lion. But be­cause a large num­ber of triplets are in genes out­side the train­ing set, that meant some of those triplets (not most, but about a thou­sand of them) had true pri­ors of be­ing found some­where in those genes of nearly one half. (You can work it out in more de­tail by as­sum­ing a Zipf law dis­tri­bu­tion of pri­ors, but I won’t get into that.)

So some of them did oc­cur within genes in that genome, and each time one did, its as­signed prob­a­bil­ity of zero an­nihilated all the hun­dreds of other pieces of ev­i­dence for the ex­is­tence of that gene, mak­ing the gene im­pos­si­ble to de­tect.

You can think of this us­ing log­a­r­ithms. I computed

P(gene | se­quence) = P(se­quence | gene) * P(gene) /​ P(se­quence)

where P(se­quence) and P(se­quence | gene) are com­puted us­ing the two Markov mod­els. Each of them is the product of a se­quence of Markov prob­a­bil­ities. Ig­nor­ing P(gene), which is con­stant, we can compute

log(P(gene|se­quence)) ~ log(P(se­quence | gene)) - log(P(se­quence)) =

sum (over all codon triplets in the se­quence) [ log(P(codon3 | codon1, codon2, gene)) - log(P(codon3 | codon1, codon2)) ]

You can think of this as adding the bits of in­for­ma­tion it would take to spec­ify that triplet out­side of a gene, and sub­tract­ing the bits of in­for­ma­tion it would take to spec­ify that in­for­ma­tion in­side a gene, leav­ing bits of ev­i­dence that it is in a gene.

If we as­sign P(codon3 | codon1, codon2, gene) = 0, the num­ber of bits of in­for­ma­tion it would take to spec­ify “codon3 | codon1, codon2” in­side a gene is -log(0) = in­finity. As­sign P(X) = 0 is claiming to have in­finite bits of in­for­ma­tion that X is false.

Go­ing back to the ar­gu­ment, the ac­cu­racy of the prob­a­bil­ities as­signed by the Markov model are quite low, prob­a­bly one to three digits of ac­cu­racy in most cases. Yet it was im­por­tant to as­sign pos­i­tive prob­a­bil­ities to events whose prob­a­bil­ities were at least seven or­ders of mag­ni­tude be­low that.

It didn’t mat­ter what prob­a­bil­ity I as­signed to them! Given hun­dreds of other bits scores to add up, chang­ing the num­ber of bits taken away by one highly im­prob­a­ble event by 10 had lit­tle im­pact. It just mat­ters not to make it zero.