Inspired partially by this post and partially by trying to think of simple test cases for a machine learning project I’m working on, here is a (not too hard, you should try answering it yourself) question: Let’s say we’ve observed n trials of a Bernoulli random variable, and k had a 1 outcome (so n−k were 0). Laplace’s rule of succession (uniform prior over success probability) says that we should estimate a probability of (k+1)/(n+2) for the next trial being 1. The question is: What is the prior over bitstrings s of length n+1 implied by Laplace’s rule of succession? In other words, can we convert the rule of succession formula into a probability distribution p(s) over bitstrings s that record outcomes of n+1 trials?
Additional clarification of the problem:
Given any particular observation of n trials, there will be two bitstrings s0,s1 that are consistent with it, where the last (unobserved) trial is 0 or 1 respectively. We can compute the 1 probability (which should equal the result from the rule of succession) as:
p(s1)p(s0)+p(s1)=w(sobs)+1n+2
where sobs=s0[1:]=s1[1:] is the first n bits of the string (corresponding to visible observations) and w is the Hamming weight function (counts the number of 1s in a bitstring). Since this requires a normalization anyway, you can also just provide an energy function E(s) as your answer. The probability formula in this case is:
e−E(s1)e−E(s0)+e−E(s1)=w(sobs)+1n+2
If we just pick a uniform distribution over bitstrings, that doesn’t work. Then the predicted probability of the next trial is always just 1/2.
Answer:
The following energy function works:
E(s)=log(n+1w(s))
This can be checked by computing the probability as:
This energy function biases the distribution towards strings with more extreme ratios between counts of 0 and 1. We can think of it as countering the entropic effect of strings with an equal balance of 0 and 1 being the most prevalent.
Inspired partially by this post and partially by trying to think of simple test cases for a machine learning project I’m working on, here is a (not too hard, you should try answering it yourself) question: Let’s say we’ve observed n trials of a Bernoulli random variable, and k had a
1outcome (so n−k were0). Laplace’s rule of succession (uniform prior over success probability) says that we should estimate a probability of (k+1)/(n+2) for the next trial being1. The question is: What is the prior over bitstrings s of length n+1 implied by Laplace’s rule of succession? In other words, can we convert the rule of succession formula into a probability distribution p(s) over bitstringssthat record outcomes of n+1 trials?Additional clarification of the problem:
Given any particular observation of n trials, there will be two bitstrings s0,s1 that are consistent with it, where the last (unobserved) trial is
0or1respectively. We can compute the1probability (which should equal the result from the rule of succession) as:p(s1)p(s0)+p(s1)=w(sobs)+1n+2
where sobs=s0[1:]=s1[1:] is the first n bits of the string (corresponding to visible observations) and w is the Hamming weight function (counts the number of
1s in a bitstring). Since this requires a normalization anyway, you can also just provide an energy function E(s) as your answer. The probability formula in this case is:e−E(s1)e−E(s0)+e−E(s1)=w(sobs)+1n+2
If we just pick a uniform distribution over bitstrings, that doesn’t work. Then the predicted probability of the next trial is always just 1/2.
Answer:
The following energy function works:
E(s)=log(n+1w(s))
This can be checked by computing the probability as:
(n+1w(s1))−1(n+1w(s0))−1+(n+1w(s1))−1=(n+1w(sobs)+1)−1(n+1w(sobs))−1+(n+1w(sobs)+1)−1
=w(sobs)+1w(sobs)+1+n+1−w(sobs)=w(sobs)+1n+2
This energy function biases the distribution towards strings with more extreme ratios between counts of
0and1. We can think of it as countering the entropic effect of strings with an equal balance of0and1being the most prevalent.