I believe this post is (for the most part) accurate and demonstrates understanding of what is going on with logical induction. Thanks for writing (and coding) it!
Note that I tentatively think this will be the last post in the Geometric Rationality sequence.
I think it is EDT vs UDT. We prefer B to A, but we prefer CA to CB, not because of dutch books, but because CA is good enough for Bob to be fair, and A is not good enough for Bob.
I don’t know that I have much “solid theoretical grounding.” From my perspective, this sequence is me putting together a bunch of related concepts (and thus doing some of the hard parts of noticing that they are related), but not really giving good theoretical grounding. In fact, I was putting off posting this sequence, so I could have time to develop theoretical grounding, but then gave up on that and just posted what I had in response to the community wanting orientation around FTX.
Yeah, I think this definition is more centrally talking about Nash bargaining than Kelly betting. Kelly betting can be expressed as maximizing a utility function that is logarithmic in wealth, and so can be seen as VNM rational
The point I was trying to make with the partial functions was something like “Yeah, there are 0s, yeah it is bad, but at least we can never assign low probability to any event that any of the hypotheses actually cares about.” I guess I could have make that argument more clearly if instead, I just pointed out that any event in the sigma algebra of any of the hypotheses will have probability at least equal to the probability of that hypothesis times the probability of that event in that hypothesis. Thus the 0s (and the 10−9s) are really coming from the fact that (almost) nobody cares about those events.
I agree with all your intuition here. The thing about the partial functions is unsatisfactory, because it is discontinuous.
It is trying to be #1, but a little more ambitious. I want the distribution on distributions to be a new type of epistemic state, and the geometric maximization to be the mechanism for converting the new epistemic state to a traditional probability distribution. I think that any decent notion of an embedded epistemic state needs to be closed under both mixing and coarsening, and this is trying to satisfy that as naturally as possible.
I think that the 0s are pretty bad, but I think they are the edge case of the only reasonable thing to do here. I think the reason it feels like the only reasonable thing to do for me is something like credit assignment/hypothesis autonomy. If a world gets probability mass, that should be because some hypothesis or collection of hypotheses insisted on putting probability mass there. You gave an edge case example where this didn’t happen. Maybe everything is edge cases. I am not sure.
It might be that the 0s are not as bad as they seem. 0s seem bad because we have cached that “0 means you cant update” but maybe you aren’t supposed to be updating in the output distribution anyway, you are supposed to do you updating in the more general epistemic state input object.
I actually prefer a different proposal for the type of “epistemic state that is closed under coarsening and mixture” that is more general than the thing I gesture at in the post:
A generalized epistemic state is a (quasi-?)convex function ΔW→R. A standard probability distribution is converted to an epistemic state through P↦(Q↦DKL(P||Q)). A generalized epistemic state is converted to a (convex set of) probability distribution(s) by taking an argmin. Mixture is mixture as functions, and coarsening is the obvious thing (given a function W→V, we can convert a generalized epistemic state over V to a generalized epistemic state over W by precomposing with the obvious function from ΔW to ΔV.)
The above proposal comes together into the formula we have been talking about, but you can also imagine having generalized epistemic states that didn’t come from mixtures of coarse distributions.
I think your numbers are wrong, and the right column on the output should say 20% 20% 20%.
The output actually agrees with each of the components on every event in that component’s sigma algebra. The input distributions don’t actually have any conflicting beliefs, and so of course the output chooses a distribution that doesn’t disagree with either.
I agree that the 0s are a bit unfortunate.
I think the best way to think of the type of the object you get out is not a probability distribution on W, but what I am calling a partial probability distribution on W. A partial probability distribution is a partial function from 2W→[0,1] that can be completed to a full probability distribution on W (with some sigma algebra that is a superset of the domain of the partial probability distribution.
I like to think of the argmax function as something that takes in a distribution on probability distributions on W with different sigma algebras, and outputs a partial probability distribution that is defined on the set of all events that are in the sigma algebra of (and given positive probability by) one of the components.
One nice thing about this definition is that it makes it so the argmax always takes on a unique value. (proof omitted.)
This doesn’t really make it that much better, but the point here is that this framework admits that it doesn’t really make much sense to ask about the probability of the middle column. You can ask about any of the events in the original pair of sigma algebras, and indeed, the two inputs don’t disagree with the output at all on any of these sets.
Yeah, I think Thompson sampling is even more robust, but I don’t know much about the nice properties of Thompson sampling besides the density 0 exploration.
Note that the cross entropy, (and thus Gx∼P[Q(x)]) is dependent on meaningless details of what events you consider the same vs different, buteH(P,P)−H(P,Q)=Gx∼P[Q(x)]/Gx∼P[P(x)]=Gx∼P[Q(x)/P(x)] is not (as much), and when maximizing with respect to Q, this is the same maximization.
(I am just pointing out that KL divergence is a more natural concept than cross entropy.)
You draw a part, which is a subset of W, and thus has a probability according to Q.
Yeah, the Thompson sampling and Nash bargaining are different in that the Thompson sampling proposal has two argmaxes, where as Nash bargaining only has one. There are really two things being brought it with Thompson sampling, and Plurality is what you get if you only add the inner argmax, and something like Nash bargaining is what you get if you only add the geometric part. There is no reason you have to add the two things at the same place. All I know is Thompson sampling has some pretty nice asymptotic guarantees.
You could just Nash bargain between your hypotheses directly, but then you are dependent on where the 0 point is. One nice thing about Thompson sampling is that it gives you a semi-principled place to put the 0, because the inner argmax means we convert everything to probabilities.
Yeah, that is correct.
The thing I originally said about (almost) uniqueness was maybe wrong. Oops! I edited, and it it is correct now.
To see that there might be many solutions under the weakest notion of egalitarianism, consider the case where there are three people, A, B, and C, with utility a, b, c, and each with probability 13. The constraints on utility are that a≤1, and that b2+c2≤1. The thing is that if we give a small enough weight to A, then almost everything we can do with B and C will be egalitarian, and anything on the Pareto frontier that gives both B and C positive utility will be able to be simultaneously egaltarian and utilitarian.
You can’t run into this problem with two people, or with everyone ending up with the same utility.
Here is a proof that we get existence and uniqueness if we also have the constraint that everyone ends up with the same utility. The construction in the main post gives existence, because everyone has utility 1.
For uniqueness, we may take some point that satisfies utilitarianism, egalitarianism, and gives everyone the same utility. WLOG, it gives everyone utility 1, and is utilitarian and egalitarian with respect to the weight vector that gives everyone weight 1. This point maximizes the expected utility with respect to your your probability distribution. It also maximizes the expected logarithm of utility. This is because it achieves an expected logarithm of 0, and the concavity of the logarithm says that the expectation of the logarithm is at most the logarithm of the expectation, which is at most the logarithm of 1, which is 0. Thus, this point is a Nash bargaining solution (i.e. the point that maximizes expected log utility), and since the Nash bargaining solution is unique, it must be unique.
Note this is only saying the utility everyone gets is unique. There still might be multiple different strategies to achieve that utility.
Sorry for the (possible) error! It might be that the original thing turns out to be correct, but It depends on details of how we define the tiered egalitarian solution.