DaemonicSigil comments on DaemonicSigil’s Shortform

DaemonicSigil 18 Oct 2025 5:27 UTC
2 points
0
Inspired partially by this post and partially by trying to think of simple test cases for a machine learning project I’m working on, here is a (not too hard, you should try answering it yourself) question: Let’s say we’ve observed $n$ trials of a Bernoulli random variable, and $k$ had a 1 outcome (so $n - k$ were 0). Laplace’s rule of succession (uniform prior over success probability) says that we should estimate a probability of $(k + 1) / (n + 2)$ for the next trial being 1. The question is: What is the prior over bitstrings $s$ of length $n + 1$ implied by Laplace’s rule of succession? In other words, can we convert the rule of succession formula into a probability distribution $p (s)$ over bitstrings s that record outcomes of $n + 1$ trials?

Additional clarification of the problem:

Given any particular observation of $n$ trials, there will be two bitstrings $s_{0}, s_{1}$ that are consistent with it, where the last (unobserved) trial is 0 or 1 respectively. We can compute the 1 probability (which should equal the result from the rule of succession) as:

$\frac{p (s_{1})}{p (s_{0}) + p (s_{1})} = \frac{w (s_{obs}) + 1}{n + 2}$

where $s_{obs} = s_{0} [1 :] = s_{1} [1 :]$ is the first $n$ bits of the string (corresponding to visible observations) and $w$ is the Hamming weight function (counts the number of 1s in a bitstring). Since this requires a normalization anyway, you can also just provide an energy function $E (s)$ as your answer. The probability formula in this case is:

$\frac{e^{- E (s_{1})}}{e^{- E (s_{0})} + e^{- E (s_{1})}} = \frac{w (s_{obs}) + 1}{n + 2}$

If we just pick a uniform distribution over bitstrings, that doesn’t work. Then the predicted probability of the next trial is always just $1 / 2$ .

Answer:

The following energy function works:

$E (s) = log (\frac{n + 1}{w (s)})$

This can be checked by computing the probability as:

$\frac{{(\frac{n + 1}{w (s_{1})})}^{- 1}}{{(\frac{n + 1}{w (s_{0})})}^{- 1} + {(\frac{n + 1}{w (s_{1})})}^{- 1}} = \frac{{(\frac{n + 1}{w (s_{obs}) + 1})}^{- 1}}{{(\frac{n + 1}{w (s_{obs})})}^{- 1} + {(\frac{n + 1}{w (s_{obs}) + 1})}^{- 1}}$

$= \frac{w (s_{obs}) + 1}{w (s_{obs}) + 1 + n + 1 - w (s_{obs})} = \frac{w (s_{obs}) + 1}{n + 2}$

This energy function biases the distribution towards strings with more extreme ratios between counts of 0 and 1. We can think of it as countering the entropic effect of strings with an equal balance of 0 and 1 being the most prevalent.