Inspired partially by this post and partially by trying to think of simple test cases for a machine learning project I’m working on, here is a (not too hard, you should try answering it yourself) question: Let’s say we’ve observed trials of a Bernoulli random variable, and had a 1 outcome (so were 0). Laplace’s rule of succession (uniform prior over success probability) says that we should estimate a probability of for the next trial being 1. The question is: What is the prior over bitstrings of length implied by Laplace’s rule of succession? In other words, can we convert the rule of succession formula into a probability distribution over bitstrings s that record outcomes of trials?
Additional clarification of the problem:
Given any particular observation of trials, there will be two bitstrings that are consistent with it, where the last (unobserved) trial is 0 or 1 respectively. We can compute the 1 probability (which should equal the result from the rule of succession) as:
where is the first bits of the string (corresponding to visible observations) and is the Hamming weight function (counts the number of 1s in a bitstring). Since this requires a normalization anyway, you can also just provide an energy function as your answer. The probability formula in this case is:
If we just pick a uniform distribution over bitstrings, that doesn’t work. Then the predicted probability of the next trial is always just .
Answer:
The following energy function works:
This can be checked by computing the probability as:
This energy function biases the distribution towards strings with more extreme ratios between counts of 0 and 1. We can think of it as countering the entropic effect of strings with an equal balance of 0 and 1 being the most prevalent.
I have read that some sequencing methods (nanopore) have a high error rate (comparing multiple reads can help correct this). Did you also spot-check some other genes that you have no reason to believe contain mutations to see if they look ok? Seeing a mutation in exactly the gene you expect is only damn strong evidence if there isn’t a sequencing error in every third gene.