cousin_it comments on Maximize Worst Case Bayes Score

cousin_it 19 Jun 2014 9:14 UTC
2 points
0
Let’s view this as a zero-sum game between you and Nature. The game state at each step is a set of sentences, which is initially empty. At each step, we consider the lowest numbered sentence that is independent from the set. You choose a probability to assign to that sentence, then Nature chooses whether to add that sentence or its negation to the set. The outcome of the game is your Bayes score after infinitely many steps. Every coherent probability assignment corresponds to a strategy you can play in this game and vice versa, while every model corresponds to a play of the game by Nature.

You have proved that the game has a minimax value. Is it true that you can achieve that value by always choosing your move so that the two possible subgames after Nature’s next move have equal minimax values? I think your conjecture would follow from that, because your Bayes score on a particular model is the limit of subgame values on the corresponding path through the game tree.

A proof could go something along these lines. Let’s say your move will assign probability p to the next sentence. The value of the left subgame increases monotonically from negative infinity at p=0, and the value of the right subgame decreases monotonically to negative infinity at p=1. So there’s a unique p where they cross over, and choosing any other p would be worse.

The above argument also seems to show that the assignment you found is the only coherent assignment that assigns equal finite Bayes scores to all models. The reason is that any other such assignment would correspond to a strategy that doesn’t always keep the WCB of the left and right subgames equal. Then we can define two strategies for Nature, “always choose the subgame with the lower WCB” and “always choose the subgame with the higher WCB”, which would give you different Bayes scores.

That leads to your second conjecture. If you start with a constant ¹⁄₂ assignment and apply the update move repeatedly, you get a sequence of assignments whose WCB increases. You can use your convergence trick to get a pointwise converging subsequence from that. If the limit is a coherent assignment, then we know that it assigns equal Bayes score to all models, so it must be the one you found. Otherwise we can apply the update move some more.

It’s a bit unsatisfying that we had to pick a converging subsequence, but I think we can also prove that the whole sequence converges. Namely, if the sequence of probabilities for any sentence has more than one limit point, then we can do the above procedure with both of those limit points separately. Since we’re guaranteed to reach the same result on both paths, that means both limit points must be the same.

Sorry about the handwaving, it’s very likely that at least some of these arguments are wrong :-)
- Scott Garrabrant 19 Jun 2014 19:01 UTC
  2 points
  0
  Parent
  So this is wrong. Imagine that P (The WCB maximizer) does the same on all models except one, and does better on that one model. Restricting to any finite list of sentences will not change the worst case bayes score. (In fact, I can prove that the set of models with bayes score near WCB must be everywhere dense.)
  
  You have said a lot of correct things that are important though.
  
  First, If the first conjecture is true, then an alternate defintion of P is “The unique coherent probability assignment for which Bayes score is constant,” which seems like an even nicer definition, and a big reason why I want to prove it. (Especially since the two definitions are the same)
  
  Second, The sequence in my second conjecture has the property that any limit point is coherent. If I were to show also that any limit point is flat (Bayes is not a constant of model), Then I would be able to extend this proof to a proof that the sequences converge, and I would be able to prove everything from this.
  
  The problem is that a limit of flat probability distributions is not necessarily flat. I actually basically have a paper written up which give the three definitions of P, and shows they are all well defined and equal (except I am missing the proof that the limit of my procedure must be flat). This paper uses the sequence definition to prove welldefinedness of the other two definitions. The stuff I wrote on the blog post is not proven that way in the paper because I had to figure out a way to reprove them without all the tools I got from analyzing the sequence.
  - cousin_it 20 Jun 2014 17:21 UTC
    2 points
    0
    Parent
    Yeah, you’re right. I assumed that Bayes(M, P) is continuous in both arguments, but we don’t know what that means or how to prove it.
    
    It’s not clear how to define continuity in P for a given M, because pointwise convergence doesn’t cut it and we don’t have any other ideas. But continuity in M for a given P seems to be easier, because we can define the distance between two M’s as the total weight of sentences where they disagree. Under this definition, I think I’ve found an example where Bayes(M, P) is not continuous in M.
    
    1) Let’s take a language consisting of sentences phi_k = “x=k” for each natural k, and all their combinations using logical connectives (and/or/not). This language has models M_k = “phi_k is true, all others are false” for each k, and also M_inf = “all phi_k are false”.
    
    2) To define the weights, fix c<1 and set mu(phi_k) = c*2^-k for each k, and distribute the remaining 1-c of weight among the remaining sentences arbitrarily.
    
    3) For every k, any statement where M_k and M_inf disagree must include phi_k as a component, because all other phi_n have the same truth value in M_k and M_inf. The total weight of sentences that include phi_k goes to zero as k goes to infinity, so the model M_inf is the limit of models M_k under the above metric.
    
    4) Now let’s define a probability assignment P so that P(phi_k) = (2^(-2^k))/(1+2^(-2^k)) for each k, and P(anything else) = ¹⁄₂. You can check that Bayes(M_k, P) has the same value for each k, but Bayes(M_inf, P) has a different, higher value.
    - Scott Garrabrant 20 Jun 2014 19:37 UTC
      0 points
      0
      Parent
      I actually thought of exactly this construction at our last MIRIx on Saturday. It made me sad. However here is a question I couldn’t answer. Can you do the same thing with a coherent P? That I do not know.
      
      If Bayes is continuous in M for coherent P then we would be able to prove the first conjecture.
      
      Something slightly easier, maybe we can show that if Bayes is not continuous at the point M for coherent P, then P must assign nonzero probability to M. I am also not sure if this would prove the conjecture, but haven’t really thought about this angle yet.
      
      Another interesting thing to think about: What happens if we start with the P you described and then do the algorithm in the second conjecture.
      
      I haven’t told you everything about this algorithm yet. One frustrating thing is that if you want to change a probability from p to q in one step, you can write down how much your bayes score increases. Since bayes score is bounded above, you can only do this finitely many times. However, if you update a probability from p to q in a series of very small steps, you could in theory only increase your bayes score by an arbitrarily small amount. :(
      
      I just recently started trying to prove the first conjecture without the second conjecture, so I have more hope that there is something easy there I just missed.
      
      I think you know most of what I know about the first conjecture. Abram gave me a sketch of a proof that if P is the WCB maximizer then WCB(P) is Exp(Bayes(M,P)) (when M is chosen according to P), but we haven’t really checked it. If you (or someone else who would like to try to find this proof) would like to know more about the second conjecture, you can pm me your email, I will send you my personal notes, and we can set up a skype chat?
      
      The second conjecture implies the first, and If that is true, I would really like to know, because I think it makes the result much nicer.
  - cousin_it 20 Jun 2014 13:55 UTC
    0 points
    0
    Parent
    Yeah, you’re right. I was assuming that Bayes(M, P) is continuous in both arguments in some sense. But we haven’t proved that, and now I think I found a counterexample.
    
    Consider the language that consists of countably many sentences, each of which contradicts all the others. So the only valid models are M_k = “the kth sentence is true, all the rest are false” for each k, and M_inf = “all sentences are false”. Let’s say that the kth sentence has weight 2^-k. Now let’s define a probability assignment P that assigns to the kth sentence the probability P_k = (2^(-2^k))/(1+2^(-2^k)). You can check that Bayes(M_k, P) has the same value for all k, but Bayes(M_inf, P) has a different and higher value.
    
    That seems to falsify most naive statements like “Bayes(M, P) is continuous in M for a given P, where the distance between two M’s is the total weight of sentences where they disagree”. There’s probably some simple argument that falsifies continuity in P as well, once we define a metric on P’s (which I don’t know how to do).
  - cousin_it 19 Jun 2014 20:17 UTC
    0 points
    0
    Parent
    
    Imagine that P (The WCB maximizer) does the same on all models except one, and does better on that one model.
    
    I don’t think that’s possible. Bayes(M, P) for a fixed P is a continuous function of M, if you define the distance between two models as the total weight of sentences where they disagree. Or am I missing something?
    
    The problem is that a limit of flat probability distributions is not necessarily flat.
    
    Wait, how can that be? Do you have an example?