TheMajor comments on A Fervent Defense of Frequentist Statistics

TheMajor 10 Feb 2014 7:43 UTC
17 points
0
I am confused about your use of the word optimal. In particular in the sentences

Bayesian methods are optimal (except for computational considerations).

and

Bayesian methods are locally optimal, not global optimal.

are you talking about the same sort of ‘optimal’? From wikipedia (here and here) I found the rigorous definition of the word ‘optimal’ in the second sentence, which can be written in terms of your utility function (a decision rule is optimal if there is no other decision rule which will always give you at least as much utility and in at least one world will give you more utility).

Also I agree with many of your myths, namely 3,8,9 and 11. I was rather surprised to see that these things even needed to be mentioned, I don’t see why making good trade-offs between truth and computation time should be ‘simple’ (3), as you mentioned the frequentist tests are chosen precisely with robustness in mind (8), bad science is more than getting your statistics wrong (9) (small sidenote: while it might be true that scientists can get confused by frequentist statistics, which might corrupt their science, I don’t think the problem would be smaller when using a different form of statistics, and I therefore think it would not be correct to attribute this bad science to frequentism), and we know from practice that Bayesianism (not frequentism) is the method which has most problems with computational bounds (11).

However, I think it is important to make a distinction between the validity of Bayesianism and the application of Bayesianism. I recall reading on lesswrong (although I cannot find the post at this moment) that the relation between Bayesianism and frequentism should be seen like the relation between Quantum Mechanics and classical physics (although QM has lots of experimental data to support it, so it is rightfully more accepted than Bayesianism). Like QM, Bayesianism is governed by simple mathematical rules (Schrodinger’s equation and Bayes’ theorem), which will give the right answer when supplied with the correct initial conditions. However, to fly a plane we do not invoke QM, and similarly we will in most practical instances to estimate a parameter not invoke Bayes. Instead we use approximations (classical physics/frequentism), which will not give the exact answer but will give a good approximation thereof (as you mention: a method close to the global optimum, although I am still unclear what we are optimising for there). The key point is that these approximations are correct only insofar as they approximate the answer that would be given by the correct theory. If classical physics and QM disagree then QM was correct. Similarly if we have a parameter estimation obtained by a Bayesian algorithm, and one using a frequentist algorithm, the Bayesian one is going to give the correct subjective probability. But the correct algorithms are (nearly?) impossible to implement, so we stick with the approximations. This is why physicists still use and teach classical physics, and why I personally endorse many frequentist tools. The difference between validity and application seems to be lost in myths 4-7 and 10:
- 4: Strictly speaking the only way to truly share your arguments for having a certain degree of belief in a hypothesis would be to share all sensory data that is dependent on the hypothesis (after all, this is how evidence works). This is clearly not feasible, but it would be the correct thing to do if we only care about being correct. You explain in this myth that this does not lead to a simple and quick algorithm. But this is not an argument against validity, it is an argument against a possible application.
- 5: Again this whole myth deals with application. The myth you debunk states that the approximations made when turning degrees of belief into an actual strategy must be bad, and you debunk this by giving an algorithm that gets very good results. But this is not an argument that distinguishes between Bayesianism and frequentism, it merely states that there are easy-to-compute (in a relative sense) algorithms that get close to the correct answer, which we know how to find in theory but not in practice. (In case you are wondering: the approximation takes place in the step where you simplify your utility function to the Regret function, and your prior is whatever decision rule you use for the first horse.)
- 6: This myth hinges on the word ‘simple’. Frequentist methods can deal with many complicated problems, and a lot of high quality work has been done to increase the scope of the tools of frequentism. Saying that only simple models can be dealt with would be an insult. However, as mentioned above, these methods are all approximations, and each method is valid only if the approximations made are satisfied. So while frequentist methods can deal with many complicated models it is imporant to realise that the scope of each method is limited.
- 7+10: Myth 10 seems to be a case of confusion by the people using the tools. Frequentist methods (derived from approximations) come with boundaries, such as limitations on the type of model that can be distilled from data or limitations on the meaning of the outcome of the algorithm (it might answer a different question than the one you hoped to answer). If you break one of these limitations it is not surprising that the results are wacky. This is not a problem of frequentism, provided the tools are explained properly. If the tools are not explained properly then problems arise. Your explanation, we have a class M and a solution E, and we look for a simple approximation which will give E+epsilon, is very clear. Problems arise when the class M is not specified, or the existence of E is unclear. I would like to classify this as an error of the practitioners of frequentism, rather than an error of the method.
Lastly I would like to make a small note that the example on myth 10 is very similar to something called the Boltzmann distribution from statistical physics, discovered in the 19th century. Here the function phi is the energy divided by the temperature.

Edit: during the writing of this post it seems that other people have already made this remark on myth 10. I agree that physicists would probably not interpret this as a game played between nature and the predictor.
- V_V 10 Feb 2014 16:19 UTC
  3 points
  0
  Parent
  
  However, I think it is important to make a distinction between the validity of Bayesianism and the application of Bayesianism. I recall reading on lesswrong (although I cannot find the post at this moment) that the relation between Bayesianism and frequentism should be seen like the relation between Quantum Mechanics and classical physics
  
  Quantum Mechanics isn’t consistent with General Relativity, our best explanation of gravity. Despite decades of trying, neither can be formulated as an approximation of the other. Even if one day physicists finally figure out a “Theory of Everything”, it would still be a model. It would be epistemically incorrect to claim it was “exact”.
  
  Curiously, there is one interpretation of QM known as Quantum Bayesianism, which holds that wavefunctions are subjective and they are the fundamental concepts for reasoning about the world, and subjective probability distributions arise as approximations of wavefunctions under decoherence. That is, Bayesianism itself might be an approximation of a “truer” epistemic theory!
  
  My humble opinion is that there is no ultimately “true” epistemic theory. They are all just models of what humans do to gain knowledge of the world. Some models can work better than others, often within certain bounds, but none of them is perfect.
  - IlyaShpitser 10 Feb 2014 18:11 UTC
    2 points
    0
    Parent
    I am very interested in Quantum Bayesianism (in particular Leifer’s work) because one of the things we have to do to be “quantum Bayesians” is figure out a physically neutral description of quantum mechanics, that is, a description of quantum mechanics that doesn’t use physical jargon like ‘time.’ In particular, physicists I believe describe spacelike and timelike separated entanglement differently.
    
    That is, a Bell inequality violation system (that is where B and C are space separated) has this graph
    
    A → B <-> C ← D
    
    (where famously, due to Bell inequality violation, there is no hidden variable corresponding to the bidirected arc connecting B and C).
    
    But the same system can arise in a temporally sequential model which looks like this:
    
    A → B → D → C, with B <-> C
    
    where an appropriate manipulation of the density matrix corresponding to this system ought to give us the Bell system above. In classical probability we can do this. In other words, in classical probability the notion of “probabilistic dependence” is abstracted away from notions like time and space.
    
    Also we have to figure out what “conditioning” even means. Can’t be Bayesian if we don’t condition, now can we!
    - V_V 10 Feb 2014 18:51 UTC
      1 point
      0
      Parent
      
      where an appropriate manipulation of the density matrix corresponding to this system ought to give us the Bell system above. In classical probability we can do this. In other words, in classical probability the notion of “probabilistic dependence” is abstracted away from notions like time and space.
      
      Yes, but the notion of Bayesian inference, where you start with a prior and build a sequence of posteriors, updating as evidence accumulates, has an intrinsic notion of time. I wonder if that’s enough for Quantum Bayesianism (I haven’t read the original works, so I don’t really know much about it).
      - IlyaShpitser 10 Feb 2014 20:57 UTC
        4 points
        0
        Parent
        The temporal order for sequential computation of posteriors is just our interpretation, it is not a part of the formalism. If we get pieces of evidence e1, e2, …, ek in temporal order, we could do Bayesian updating in the temporal order, or the reverse of the temporal order, and the formalism still works (that is our overall posterior will be the same, because all the updates commute). And that’s because Bayes theorem says nothing about time anywhere.
  - jsteinhardt 10 Feb 2014 18:02 UTC
    0 points
    0
    Parent
    
    My humble opinion is that there is no ultimately “true” epistemic theory. They are all just models of what humans do to gain knowledge of the world. Some models can work better than others, often within certain bounds, but none of them is perfect.
    
    Exactly!
- jsteinhardt 10 Feb 2014 8:07 UTC
  0 points
  0
  Parent
  Thanks for your comments. One thing you say a few times throughout your comment is “frequentist methods are an approximation to Bayes”. I wouldn’t agree with this. I think Bayesian and frequentist methods are often trying to do different things (although in many practical instances their usage overlaps). In what sense do you believe that Bayes is the “correct” answer?
  
  At the beginning of your comment, I would have used “admissible” rather than “optimal” to describe the definition you gave:
  
  a decision rule is optimal if there is no other decision rule which will always give you at least as much utility and in at least one world will give you more utility
  
  I don’t see how the online learning algorithm in myth 5 can be interpreted as an approximation to Bayes. The guarantee I’m getting just seems way better and more awesome than what Bayes provides. I also don’t think it’s right to say that “regret is an approximation to utility”. Regret is an alternative formulation to utility that happens to lead to a set of very fruitful results, one of which I explained under myth 5.
  - TheMajor 10 Feb 2014 8:40 UTC
    2 points
    0
    Parent
    While writing this answer I realised I forgot an important class of exceptions, namely the typical school example of hypothesis testing. My explanation now consists of multiple parts.
    
    To answer the first question: the Bayesian method gives the “correct” answer in the sense that it optimises the expectation of your utility function. If you choose a utility function like log(p) this means that you will find your subjective probabilities. I also think Bayesianism is “correct” in the philosophical sense (which is a property of the theory), but I believe there are many posts on lesswrong that can explain this better than I can.
    
    The approximation made can often be rewritten in terms of a particular choice of utility function (or risk function, which is more conventional according to wikipedia). As you mentioned choosing the Regret function for cost and a non-silly prior (for example whichever one you are using) will yield a Bayesian algorithm to your problem. Unfortunately I haven’t looked at the specific algorithm in detail, but if admissible solutions are Bayesian algorithms, why would a Bayesian approach using your data not outperform (and therefore produce at least as good asymptotic behaviour) the frequentist algorithm? Also I would like to leave open the possibility that the algorithm you mention actually coincides with a Bayesian algorithm. Sometimes a different approach (frequentism/Bayesianism) can lead to the same conclusion (method).
    
    Suppose I find myself in a situation in which I have several hypotheses and a set of data. The thing I’m interested in is the probability of each hypothesis given the data (in other words, finding out which hypothesis is correct). In frequentism there is no such thing as a ‘probability of the hypothesis’, after all a hypothesis is either true or false and we don’t know which. So as a substitution frequentists consider the other conditional probability, the probability of seeing this data or worse provided the hypothesis is true, where worse must be defined beforehand. I’d say this is a wrong approach, a very very wrong approach. My opinion is that frequentists have adopted an incorrect worldview which leads them to dismiss and answer the wrong questions in this case. Here I expect pure conflict rather than some Bayesian approach which will coincide with frequentist methods.
    
    I hope this explains how Bayesian and frequentist methods overlap and seem to disagree sometimes, and how many instances of frequentist algorthms should be compared to Bayesian algorithms with a properly chosen utility function.
    - IlyaShpitser 10 Feb 2014 14:33 UTC
      6 points
      0
      Parent
      Say I am interested in distinguishing between two hypotheses for p(a,b,c,d) (otherwise unrestricted):
      
      hypothesis 1: “A is independent of B, C is independent of D, and nothing else is true”
      
      hypothesis 2: “no independences hold”
      
      Frequentists can run their non-parametric marginal independence tests. What is the (a?) Bayesian procedure here? As far as I can tell, for unrestricted densities p(a,b,c,d) no one knows how to write down the likelihood for H1. You can do a standard Bayesian setup here in some cases, e.g. if p(a,b,c,d) is multivariate normal, in which case H1 corresponds to a (simple) Gaussian ancestral graph model. Maybe one can do some non-parametric Bayes thing (???). It’s not so simple to set up the right model sometimes, which is what Bayesian methods generally need.
      - johnswentworth 11 Feb 2014 5:56 UTC
        6 points
        0
        Parent
        You should check out chapter 20 of Jaynes’ Probability Theory, which talks about Bayesian model comparison.
        
        We wish to calculate P[H1 | data] / P[H2 | data] = P[data | H1] / P[data | H2] * P[H1] / P[H2].
        
        For Bayesians, this problem does not involve “unrestricted densities” at all. We are given some data and presumably we know the space from which it was drawn (e.g. binary, categorical, reals...). That alone specifies a unique model distribution. For discrete data, symmetry arguments mandate a Dirichlet model prior with the categories given by all possible outcomes of {A,B,C,D}. For H2, the Dirichlet parameters are updated in the usual fashion and P[data | H2] calculated accordingly.
        
        For H1, our Dirichlet prior is further restricted according to the independencies. The resulting distribution is not elegant (as far as I can tell), but it does exist and can be updated. For example, if the variables are all binary, then the Dirichlet for H2 has 16 categories. We’ll call the 16 frequencies X0000, X0001, X0010, … with parameters a0000, a0001, … where the XABCD are the probabilities which the model given by X assigns to each outcome. Already, the Dirichlet for H2 is constrained to {X | sum(X) = 1, X > 0} within R^16. The Dirichlet for H1 is exactly the same function, but further constrained to the space {X | sum(X) = 1, X > 0, X00.. / X10.. = X01.. / X11.., X..00 / X10.. = X..01 / X..11} within R^16. This is probably painful to work with (analytically at the very least), but is fine in principle.
        
        So we have P[data | H1] and P[data | H2]. That just leaves the prior probabilities for each model. At first glance, it might seem that H1 has zero prior, since it corresponds to a measure-zero subset of H2. But really, we must have SOME prior information lending H1 a nonzero prior probability or we wouldn’t bother comparing the two in the first place. Beyond that, we’d have to come up with reasonable probabilities based on whatever prior information we have. Given no other information besides the fact that we’re comparing the two, it would be ⁵⁰⁄₅₀.
        
        Of course this is all completely unscalable. Fortunately, we can throw away information to save computation. More specifically, we can discretize and bin things much like we would for simple marginal independence tests. While it won’t yield the ideal Bayesian result, it is still the ideal result given only the binned data.
        
        I am a bit curious about the non-parametric tests used for H1. I am familiar with tests for whether A and B are independent, and of course they can be applied between C and D, but how does one test for independence between both pairs simultaneously without assuming that the events (A independent of B) and (C independent of D) are independent? It is precisely this difficulty which makes the Bayesian likelihood calculation of H1 such a mess, and I am curious how frequentist methods approach it.
        
        My apologies for the truly awful typesetting, but this is not the evening on which I learn to integrate tex in lesswrong posts.
        IlyaShpitser 11 Feb 2014 8:24 UTC
        3 points
        0
        Parent
        Thanks for this post.
        
        The resulting distribution is not elegant (as far as I can tell).
        
        In the binary case, the saturated model can be parameterized by p(S = 0) for S any non-empty subset of { a,b,c,d }. The submodel corresponding to H1 is just one where p({a,b} = 0) = p({a}=0)p({b}=0), and p({c,d} = 0) = p({c}=0)p({d}=0).
        
        For Bayesians, this problem does not involve “unrestricted densities” at all.
        
        I am sorry, Bayesians do not get to decide what my problem is. My problem involves unrestricted densities by definition. I don’t think you get to keep your “fully general formalism” chops if you suddenly start redefining my problem for me.
        
        how does one test for independence between both pairs simultaneously without assuming that the events (A independent of B) and (C independent of D) are independent?
        
        This is a good question. I don’t know a good answer to this that does not involve dealing with the likelihood in some way.
        johnswentworth 11 Feb 2014 17:03 UTC
        7 points
        0
        Parent
        Sorry, I didn’t mean to be dismissive of the general densities requirement. I mean that data always comes with a space, and that restricts the density. We could consider our densities completely general to begin with, but as soon as you give me data to test, I’m going to look at it and say “Ok, this is binary?” or “Ok, these are positive reals?” or something. The space gives the prior model. Without that information, there is no Bayesian answer.
        
        I guess you could say that this isn’t fully general because we don’t have a unique prior for every possible space, which is a very valid point. For the spaces people actually deal with we have priors, and Jaynes would probably argue that any space of practical importance can be constructed as the limit of some discrete space. I’d say it’s not completely general, because we don’t have good ways of deriving the priors when symmetry and maximum entropy are insufficient. The Bayesian formalism will also fail in cases where the priors are non-normalizable, which is basically the formalism saying “Not enough information.”
        
        On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer. I would expect such methods to rely on additional information which would yield a proper prior.
        
        Regarding that ugly distribution, that parameterization is basically where the constraints came from. Remember that the Dirichlets are distributions on the p’s themselves, so it’s an hierarchical model. So yes, it’s not hard to right down the subspace corresponding to that submodel, but actually doing an update on the meta-level distribution over that subspace is painful.
        IlyaShpitser 11 Feb 2014 17:27 UTC
        1 point
        0
        Parent
        
        I mean that data always comes with a space, and that restricts the density.
        
        Sorry I am confused. Say A,B,C,D are in [0,1] segment of the real line. This doesn’t really restrict anything.
        
        For the spaces people actually deal with we have priors.
        
        I deal with this space. I even have a paper in preparation that deals with this space! So do lots of people that worry about learning graphs from data.
        
        On the other hand, I would be very surprised to see any other method which works in cases where the Bayesian formalism does not yield an answer.
        
        People use variations of the FCI algorithm, which from a Bayesian point of view is a bit of a hack. The asymptopia version of FCI assumes a conditional independence oracle, and then tells you what the model is based on what the oracle says. In practice, rather than using an oracle, people do a bunch of hypothesis tests for independence.
        
        Regarding that ugly distribution
        
        You are being so mean to that poor distribution. You know, H1 forms a curved exponential family if A,B,C,D are discrete. That’s sort of the opposite of ugly. I think it’s beautiful! H1 is an instance of Thomas Richardson’s ancestral graph models, with the graph:
        
        A <-> B <-> C <-> D <-> A
        johnswentworth 12 Feb 2014 3:27 UTC
        4 points
        0
        Parent
        Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals, distributions over R^n, distributions over words starting with the letter k, distributions over Turing machines, distributions over elm trees more than 4 years old in New Hampshire, distributions over bizarre mathematical objects that I can’t even think of… That’s a LOT of prior information. It’s a continuous space, so we can’t apply a maximum entropy argument directly to find our prior. Typically we use the beta prior for [0,1] due to a symmetry argument, but that admittedly is not appropriate in all cases. On the other hand, unless you can find dependencies after running the data through the continuous equivalent of a pseudo-random number generator, you are definitely utilizing SOME additional prior information (e.g. via smoothness assumptions). When the Bayesian formalism does not yield an answer, it’s usually because we don’t have enough prior info to rule out stuff like that.
        
        I think we’re still talking past each other about the distributions. The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]. Perhaps the notation p[A,B,C,D ; X] is more familiar? Anyway, the hypothesis H1 corresponds to a subset of possible values of X. The beautiful distribution you talk about is p[A,B,C,D | X], which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph. Under that parameterization, X would be the lambda vector specifying the exponential model. Unfortunately, p[X] is the ugly one, and that elegant parameterization for p[A,B,C,D | X] will probably make p[X] even uglier.
        
        It is much prettier for DAGs. In that case, we’d have one beta distribution for every possible set of inputs to each variable. X would then be the set of parameters for all those beta distributions. We’d get elegant generative models for numerical integration and life would be sunny and warm. So the simple use case for FCI is amenable to Bayesian methods. Latent variables are still a pain, though. They’re fine in theory, just integrate over them when calculating the posterior, but it gets ugly fast.
        IlyaShpitser 12 Feb 2014 8:56 UTC
        1 point
        0
        Parent
        
        Oh, saying A,B,C,D are in [0,1] restricts quite a bit. It eliminates distributions with support over all the reals
        
        ???
        
        There are easy to compute bijections from R to [0,1], etc.
        
        The Bayesian approach to this problem uses an hierarchical distribution with two levels: one specifying the distribution p[A,B,C,D | X] in terms of some parameter vector X, and the other specifying the distribution p[X]
        
        Yes, parametric Bayes does this. I am giving you a problem where you can’t write down p(A,B,C,D | X) explicitly and then asking you to solve something frequentists are quite happy solving. Yes I am aware I can do a prior for this in the discrete case. I am sure a paper will come of it eventually.
        
        Latent variables are still a pain, though.
        
        The whole point of things like the beautiful distribution is you don’t have to deal with latent variables. By the way the reason to think about H1 is that it represents all independences over A,B,C,D in this latent variable DAG:
        
        A ← u1 → B ← u2 → C ← u3 → D ← u4 → A
        
        where we marginalize out the ui variables.
        
        which can indeed be written quite elegantly as an exponential family distribution with features for each clique in the graph
        
        I think you might be confusing undirected and bidirected graph models. The former form linear exponential families and can be parameterized via cliques, the latter form curved exponential families, and can be parameterized via connected sets.
        labachevskij 19 Feb 2014 12:44 UTC
        2 points
        0
        Parent
        
        There are easy to compute bijections from R to [0,1], etc.
        
        This is not true, there are bijections between R and (0,1), but not the closed interval.
        
        Anyway there are more striking examples, for example if you know that A, B, C, D are in a discrete finite set, it restricts yout choices quite a lot.
        Expand this thread
        V_V 19 Feb 2014 14:19 UTC
        6 points
        0
        Parent
        
        This is not true, there are bijections between R and (0,1), but not the closed interval.
        
        No.
        nshepperd 19 Feb 2014 17:04 UTC
        5 points
        0
        Parent
        Did you mean to say continuous bijections? Obviously adding two points wouldn’t change the cardinality of an infinite set, but “easy to compute” might change.
        labachevskij 21 Feb 2014 9:52 UTC
        2 points
        0
        Parent
        You’re right, I meant continuous bijections, as the context was a transformation of a probability distribution.
        IlyaShpitser 19 Feb 2014 12:49 UTC
        0 points
        0
        Parent
        
        This is not true, there are bijections between R and (0,1), but not the closed interval.
        
        You are right, apologies.
    - private_messaging 12 Feb 2014 10:24 UTC
      −1 points
      0
      Parent
      
      In frequentism there is no such thing as a ‘probability of the hypothesis’, after all a hypothesis is either true or false and we don’t know which. So as a substitution frequentists consider the other conditional probability, the probability of seeing this data or worse provided the hypothesis is true, where worse must be defined beforehand. I’d say this is a wrong approach, a very very wrong approach.
      
      That’s not a substitution, and it’s the probability of seeing the data provided the hypothesis is false, not true.
      
      It gives the upper bound on the risk that you’re going to believe in a wrong thing if you follow the strategy of “do experiments, believe the hypothesis if confirmed”.
      
      Mostly we want to update all probabilities until they’re very close to 0 or to 1 , because the uncertainty leads to loss of expected utility in the future decision making.
      
      In frequentism there is no such thing as a ‘probability of the hypothesis’
      
      Yeah, and in Bayesianism, any number between 0 and 1 will do—there’s still no such thing as a specific “probability of the hypothesis”, merely a change to an arbitrary number.
      
      edit: it’s sort of like arguing that worst-case structural analysis of a building or a bridge is a “very very wrong approach”, and contrast it with some approach where you make up priors about the quality of concrete, and end up shaving a very very small percent off the construction cost, while building a weaker bridge which bites you in the ass eventually anyway when something unexpected happens to the bridge.