Squark comments on Maximize Worst Case Bayes Score

Squark 19 Jun 2014 8:04 UTC
2 points
0
You give a natural construction of a coherent probability assignment give a probability measure over sentences. Now, there is another such natural construction. My hunch is that they are the same: it would be interesting to prove/disprove.

What is the other construction? Start with an empty set of known sentences. On each iteration, pick a sentence at random using the probability measure on sentences. If it is consistent with the current set, add it to the set. Continue ad infinitum. The probability assigned to a sentence is the probability it will be added to the known set at some point during this procedure. I think it is very close to the proposal by Abram Densky (maybe identical? I need to reread it to remember).

There is a general probability theory analogue of this situation. Consider a set X and a countable system of subsets Si. Equip X with the sigma algebra generated by Si. Suppose there is a probability measure on the set of indices {i} is given. Then the analogue of the above procedure is as follows. Start with the set Y = X. On each iteration, pick an index i at random using the probability measure on indices. If Si intersects Y, replace Y by the intersection. Continue ad infinitum. The probability of a measurable subset T of X is the probability it will contain Y at some point during this procedure.

Your construction can also be generalized to this setting. Just define the Bayes score of by summing over Si. Coherence of a probability assignment to the Si corresponds to the assignment coming from a probability measure on X. Models corresponds to elements of X. So maybe both constructions coincide in the general setting?
- Scott Garrabrant 19 Jun 2014 8:29 UTC
  1 point
  0
  Parent
  I have I already presented this to Abram Demski, and he and I have been working together on trying to prove my conjecture. (He and I are both in Los Angeles, and coincidentally are interested in the same question, so it is likely to be the direction that the MIRIxLosAngeles workshop continues to focus on.)
  
  Your proposal is equivalent to Abram’s proposal. We believe the two distributions are not the same. I think we checked this for some small finite analogue.
  
  Your “general” setting does not seem that much more general to me, It seems like it is pretty much identical, only reworded in terms of set theory instead of logic. There is one way in which it is more general. In my system, the set of subsets must be closed under union, intersection, and complement, and this is not true for a general collection of subsets. However, my construction does not work without this assumption. I actually use the fact that \mu is nowhere zero, and not being closed under union intersection and complement is kind of liking having some sets have measure 0.
  
  I think the language of subsets instead of logic makes things a little easier to think about for some people. I think I prefer the logic language. (However when I wanted to think about small finite examples I would sometimes think about it in the subset language instead)
  - Squark 19 Jun 2014 11:00 UTC
    2 points
    0
    Parent
    Firstly, I’d love to see the counterexample for the distributions being the same.
    
    Secondly, are you sure \mu nowhere zero is essential? Intuitively, your uniqueness result must work whenever for every two models M1, M2 there is a sentence \phi separating them with \mu(\phi) non-zero. But I haven’t checked it formally.
    - Scott Garrabrant 19 Jun 2014 19:32 UTC
      0 points
      0
      Parent
      At very least my conjecture is not true if \mu is not nowhere zero, which was enough for me to ignore that case, because (see my response to cousin_it) what I actually believe is that there are three very different definitions that all give the same distribution, which I think makes the distribution stand out a lot more as a good idea. Also if \mu is sometimes zero, we also lose uniqueness because we dont know what to do with sets that our Bayes score does not care about. The fact that we can do whatever we want with these sets also takes away coherence (although maybe we could artificially require coherence) I really don’t want to do that, because the whole reason I like this approach so much is that it didnt require coherence and coherence came out for free.
      
      Well for example, if Si only has one set A, Abram will think we are in that set, I will think we are in that set with probability ¹⁄₂. Now, you could require that every sentence has the same \mu as its negation, (corresponding to putting the sentence or its negation in with probability ¹⁄₂ in Abram’s procedure) in which case partition X into 3 sets, A, B, and C for which the in or not in A question is given weight muA, and similarly define muB and muC.
      
      Let muA=1/2, muB=1/4 and muC=1/4.
      
      Abrams procedure will with probability ¹⁄₄ choose A first, probability ¹⁄₈ choose B first, probability ¹⁄₈ choose C first, with probability ¹⁄₈ choose not A first then choose B with probability ¹⁄₈ choose not A first then choose C with probability ¹⁄₁₆ choose not C first and end up with B, with probability ¹⁄₁₆ choose not B first then end up with C, and with probability ¹⁄₈ choose not B or not C first then end up with A. In the end P(A) is .375.
      
      Notice that Abrams solution gives a different Bayes score when in set A than when in the other 2 sets. My Bayes score will not. My bayes score will give P(A) probability p where the bayes score is constant:
      
      ¹⁄₂ log p+1/4 log (1-(1-p)/2)+1/4 log (1-(1-p)/2)=1/2 log (1-p)+1/4 log ((1-p)/2)+1/4 log (1-(1-p)/2)
      
      2 log p+log (1-(1-p)/2)=2 log (1-p)+ log ((1-p)/2)
      
      p^2(1-(1-p)/2)=(1-p)^2((1-p)/2)
      
      p^2(1+p)=(1-p)^3
      
      p=.39661
      
      If you check this p value, you should see that bayes score is independent of model.