I could do better by imagining that I will have infinitely many independent rolls, and then updating on that average being exactly 2.0 (in the limit). IIUC that should replicate the max relative entropy result (and might be a better way to argue for the max relative entropy method), but I have not checked that myself.
I had thought about something like that, but I’m not sure it actually works. My reasoning (which I expect might be close to yours, since I learned about this theorem in a post of yours) was that by the entropy concentration theorem, most outcome sequences given a constraint have the frequencies of individual results match the maximum entropy frequency distribution. I think this would in fact imply that if we had a set of results, we were told the frequencies of those results had that constraint, and then we drew a random result out of that set, our probabilities for that result would have the maximum entropy distribution, because it’s very likely that the frequency distribution in the set of results is the maximum entropy distribution or close to it.
However, we are not actually drawing from a set of results that followed this constraint, we had a past set of results that followed it and are drawing a new result that isn’t a member of this set. In order to say knowledge of these past results influences our beliefs about the next result, our probabilities for the past results and the next results have to be correlated. And it would be a really weird coincidence if our distribution had the next result be correlated to the past results, but the past results not be correlated to each other. So the past results probably are correlated, which breaks the assumption that all possible past sequences are equally likely!
One is that the die has bias p unknown to you (you have some prior over p) and you use i.i.d flips to estimate bias as usual & get maxent distribution for a new draw. The draws are independent given p but not independent given your priors, so everything works out.
The other is that the die is literally i.i.d over your priors. In this case everything from your argument routes through: Whatever bias\constraint you happen to estimate from your outcome sequence doesn’t say anything about a new i.i.d draw because they’re uncorrelated, the new draw is just another sample from your prior
I think that’s a good way of phrasing it, except that I would emphasize that these are two different states of knowledge, not necessarily two different states of the world.
I didn’t think it would work out to the maximum entropy distribution even in your first case, so I worked out an example to check:
Suppose we have a three-sided die, that can land on 0, 1 or 2. Then suppose we are told the die was rolled several times, and the average value was 1.5. The maximum entropy distribution is (if my math is correct) probability 0.116 for 0, 0.268 for 1 and 0.616 for 2.
Now suppose we had a prior analogous to Laplace’s Rule: two parameters p0 and p1 for the “true probability” or “bias” of 0 and 1, and uniform probability density 2dp0dp1 for all possible values of these parameters (the region where their sum is less than 1, which has area 1⁄2). Then as the number of cases goes to infinity, the probability each possible set of parameter values assigns to the average being 1.5 goes to 1 if that’s their expected value, and to 0 otherwise. So we can condition on “the true values give an expected value of 1.5”. We get probabilities of 0.125 for 0, 0.25 for 1 and 0.625 for 2.
That is not exactly equal to the maximum entropy distribution, but it’s surprisingly close! Now I’m wondering if there’s a different set of priors that gives the maximum entropy distribution exactly. I really should have worked out an actual numerical example sooner; I had previously thought of this example, assumed it would end up at different values than maxentropy distribution, and didn’t go to the end and notice it ends up actually very close to it.
I had thought about something like that, but I’m not sure it actually works. My reasoning (which I expect might be close to yours, since I learned about this theorem in a post of yours) was that by the entropy concentration theorem, most outcome sequences given a constraint have the frequencies of individual results match the maximum entropy frequency distribution. I think this would in fact imply that if we had a set of results, we were told the frequencies of those results had that constraint, and then we drew a random result out of that set, our probabilities for that result would have the maximum entropy distribution, because it’s very likely that the frequency distribution in the set of results is the maximum entropy distribution or close to it.
However, we are not actually drawing from a set of results that followed this constraint, we had a past set of results that followed it and are drawing a new result that isn’t a member of this set. In order to say knowledge of these past results influences our beliefs about the next result, our probabilities for the past results and the next results have to be correlated. And it would be a really weird coincidence if our distribution had the next result be correlated to the past results, but the past results not be correlated to each other. So the past results probably are correlated, which breaks the assumption that all possible past sequences are equally likely!
IIUC there are two scenarios to be distinguished:
One is that the die has bias p unknown to you (you have some prior over p) and you use i.i.d flips to estimate bias as usual & get maxent distribution for a new draw. The draws are independent given p but not independent given your priors, so everything works out.
The other is that the die is literally i.i.d over your priors. In this case everything from your argument routes through: Whatever bias\constraint you happen to estimate from your outcome sequence doesn’t say anything about a new i.i.d draw because they’re uncorrelated, the new draw is just another sample from your prior
I think that’s a good way of phrasing it, except that I would emphasize that these are two different states of knowledge, not necessarily two different states of the world.
I didn’t think it would work out to the maximum entropy distribution even in your first case, so I worked out an example to check:
Suppose we have a three-sided die, that can land on 0, 1 or 2. Then suppose we are told the die was rolled several times, and the average value was 1.5. The maximum entropy distribution is (if my math is correct) probability 0.116 for 0, 0.268 for 1 and 0.616 for 2.
Now suppose we had a prior analogous to Laplace’s Rule: two parameters p0 and p1 for the “true probability” or “bias” of 0 and 1, and uniform probability density 2dp0dp1 for all possible values of these parameters (the region where their sum is less than 1, which has area 1⁄2). Then as the number of cases goes to infinity, the probability each possible set of parameter values assigns to the average being 1.5 goes to 1 if that’s their expected value, and to 0 otherwise. So we can condition on “the true values give an expected value of 1.5”. We get probabilities of 0.125 for 0, 0.25 for 1 and 0.625 for 2.
That is not exactly equal to the maximum entropy distribution, but it’s surprisingly close! Now I’m wondering if there’s a different set of priors that gives the maximum entropy distribution exactly. I really should have worked out an actual numerical example sooner; I had previously thought of this example, assumed it would end up at different values than maxentropy distribution, and didn’t go to the end and notice it ends up actually very close to it.