Menotim comments on The Zen Of Maxent As A Generalization Of Bayes Updates

Menotim 6 Nov 2025 16:10 UTC
3 points
0
In the example in the post, what would you say is the “prior distribution over sequences of results”?
I don’t actually know.
If it’s a binary experiment, like a “biased coin” that outputs either Heads or Tails, an appropriate distribution is Laplace’s Rule of Succession (like I mentioned). Laplace’s Rule has a parameter $p$ that is the “objective probability” of Heads, in the sense that if we know $p$ our probabilities for each result giving Heads is $p$ independently. (I don’t think it makes sense to think of $p$ as an actual probability, since it’s not anybody’s belief; I think a more correct interpretation of it is the fraction of the space of possible initial states that ends up in heads.)
Then the results are independent given the latent variable $p$ , but since we initially don’t know $p$ they’re not actually independent; learning one result gives us information about $p$ , which we can use to infer things about the next result. It ends up giving more probability to the sequences with almost all Heads or Tails. (If after seeing a Head, another Head becomes more probable, the sequence HH must necessarily have more probability than the sequence HT.)
In this case our variable is the amount of widgets, that has 100 possible values, How do you generalize Laplace’s Rule to that? I don’t know. You could do something exactly like Laplace’s Rule with 100 different “bins” instead of 2, but that wouldn’t actually capture all our intuitions. For example, after getting 34 widgets one day we’d say getting 36 the next day is more likely than getting 77. If there’s an actual distribution people use here, I’d be interested in learning about it.
The problem I have is that with any distribution, we’d perform this process of taking the observed values, updating our distributions for our latent parameters conditional on them, and using the updated distributions to make more precise predictions for future values. This process is very different from assuming that a fact about the frequencies must also hold for our distribution, then finding the “least informative” distribution with that property. In the case of Laplace’s Rule, our probability of Heads (and expected value of $p$ ) end up pretty close to the observed frequency of Heads, but that’s not a fundamental fact, it’s derived from the assumptions. Which correspondences do you derive from which assumptions, in the widget case? That is what I’m confused about.