In this example, Mr. A has learned the average numbers of red, yellow, and green orders for some past days and wants to update his predictions of today’s orders on this information. So he decides that the expected values of his distributions should be equal to those averages, and that he should find the distribution that makes the least assumptions, given those constraints. I at least agree that entropy is a good measure of how little assumptions your distribution makes. The point I’m confused about is how you get from “the average of this number in past observations is N” to “the expected value of our distribution for a future observation has to be N but we should put no other information in it”.
I agree that it’s implausible that Mr A has enough data to be confident of the averages, but not enough data to draw any other conclutions. Such is often the case with math execises. :shrug:
Second, why are you even finding a distribution that is constrainedly optimal in the first place, rather than just taking your prior distribution over sequences of results and your observations, and using Bayes’ Theorem to update your probabilities for future results? Even if you don’t know anything other than the average value, you can still take your distribution over sequences of results, update it on this information (eliminating the possible outcome sequences that don’t have this average value), and then find the distribution P(NextResult|AverageValue) by integrating P(NextResult|PastResults)P(PastResults|AverageValue) over the possible PastResults. This seems like the correct thing to do according to Bayesian probability theory, and it’s very different from doing constrained optimization to find a distribution.
In the example in the post, what would you say is the “prior distribution over sequences of results”? All Mr A has is a probability distribution for widgets each day. If I would naively turn that in distributions over sequences of widget orders each day, the simplest option is to assume inedpenent draw from that distribution each day. But then Mr A is in the same situation as the “poorly informed robot”
The reason one can’t use Bayes rule in this case is because of a type error. If Mr A had a prior probaility distribution over probability distributions, P[P_i], then he could use Bays rule, to calculate a posteriour of P[P_i], and then integrage P_final = Sum_i P[P_i] P_i. But the porblem with this is that the anser will defpend on how you generalise from P[N,N,N] to P[P_i], and there isn’t a unique way to do this.
In the example in the post, what would you say is the “prior distribution over sequences of results”?
I don’t actually know.
If it’s a binary experiment, like a “biased coin” that outputs either Heads or Tails, an appropriate distribution is Laplace’s Rule of Succession (like I mentioned). Laplace’s Rule has a parameter p that is the “objective probability” of Heads, in the sense that if we know p our probabilities for each result giving Heads is p independently. (I don’t think it makes sense to think of p as an actual probability, since it’s not anybody’s belief; I think a more correct interpretation of it is the fraction of the space of possible initial states that ends up in heads.)
Then the results are independent given the latent variable p, but since we initially don’t know p they’re not actually independent; learning one result gives us information about p, which we can use to infer things about the next result. It ends up giving more probability to the sequences with almost all Heads or Tails. (If after seeing a Head, another Head becomes more probable, the sequence HH must necessarily have more probability than the sequence HT.)
In this case our variable is the amount of widgets, that has 100 possible values, How do you generalize Laplace’s Rule to that? I don’t know. You could do something exactly like Laplace’s Rule with 100 different “bins” instead of 2, but that wouldn’t actually capture all our intuitions. For example, after getting 34 widgets one day we’d say getting 36 the next day is more likely than getting 77. If there’s an actual distribution people use here, I’d be interested in learning about it.
The problem I have is that with any distribution, we’d perform this process of taking the observed values, updating our distributions for our latent parameters conditional on them, and using the updated distributions to make more precise predictions for future values. This process is very different from assuming that a fact about the frequencies must also hold for our distribution, then finding the “least informative” distribution with that property. In the case of Laplace’s Rule, our probability of Heads (and expected value of p) end up pretty close to the observed frequency of Heads, but that’s not a fundamental fact, it’s derived from the assumptions. Which correspondences do you derive from which assumptions, in the widget case? That is what I’m confused about.
I agree that it’s implausible that Mr A has enough data to be confident of the averages, but not enough data to draw any other conclutions. Such is often the case with math execises. :shrug:
In the example in the post, what would you say is the “prior distribution over sequences of results”? All Mr A has is a probability distribution for widgets each day. If I would naively turn that in distributions over sequences of widget orders each day, the simplest option is to assume inedpenent draw from that distribution each day. But then Mr A is in the same situation as the “poorly informed robot”
The reason one can’t use Bayes rule in this case is because of a type error. If Mr A had a prior probaility distribution over probability distributions, P[P_i], then he could use Bays rule, to calculate a posteriour of P[P_i], and then integrage P_final = Sum_i P[P_i] P_i. But the porblem with this is that the anser will defpend on how you generalise from P[N,N,N] to P[P_i], and there isn’t a unique way to do this.
I don’t actually know.
If it’s a binary experiment, like a “biased coin” that outputs either Heads or Tails, an appropriate distribution is Laplace’s Rule of Succession (like I mentioned). Laplace’s Rule has a parameter p that is the “objective probability” of Heads, in the sense that if we know p our probabilities for each result giving Heads is p independently. (I don’t think it makes sense to think of p as an actual probability, since it’s not anybody’s belief; I think a more correct interpretation of it is the fraction of the space of possible initial states that ends up in heads.)
Then the results are independent given the latent variable p, but since we initially don’t know p they’re not actually independent; learning one result gives us information about p, which we can use to infer things about the next result. It ends up giving more probability to the sequences with almost all Heads or Tails. (If after seeing a Head, another Head becomes more probable, the sequence HH must necessarily have more probability than the sequence HT.)
In this case our variable is the amount of widgets, that has 100 possible values, How do you generalize Laplace’s Rule to that? I don’t know. You could do something exactly like Laplace’s Rule with 100 different “bins” instead of 2, but that wouldn’t actually capture all our intuitions. For example, after getting 34 widgets one day we’d say getting 36 the next day is more likely than getting 77. If there’s an actual distribution people use here, I’d be interested in learning about it.
The problem I have is that with any distribution, we’d perform this process of taking the observed values, updating our distributions for our latent parameters conditional on them, and using the updated distributions to make more precise predictions for future values. This process is very different from assuming that a fact about the frequencies must also hold for our distribution, then finding the “least informative” distribution with that property. In the case of Laplace’s Rule, our probability of Heads (and expected value of p) end up pretty close to the observed frequency of Heads, but that’s not a fundamental fact, it’s derived from the assumptions. Which correspondences do you derive from which assumptions, in the widget case? That is what I’m confused about.