SilasBarta comments on Case study: abuse of frequentist statistics

SilasBarta 21 Feb 2010 23:29 UTC
0 points
0
Yes, but I don’t understand from the details presented why that follows. Why couldn’t result X contain the new method ranking below the standard method?
- Psy-Kosh 21 Feb 2010 23:45 UTC
  7 points
  0
  Parent
  What the OP was saying was that this test only depends on the rankings. So to check for sanity, he calculated what the p values would have been for all possible rankings and found that none of those p values would be below .05.
  
  In other words, it was a mathematical impossibility for this test, when treated this way, to result in a rejection of the null hypothesis. There was no possible outcome given this many data points, analyzed using this method, a rejection.
  
  (in other words, it was a “heads I win, tails you lose” situation)
  - SilasBarta 22 Feb 2010 0:30 UTC
    4 points
    0
    Parent
    Okay, I think that makes sense. Let me put it into my own words:
    
    The test is guaranteed to be not statistically significant merely by virtue of cutting up the outcome space into pieces, each of which has at least 5% chance of happening. And further, because the null hypothesis has been (arbitrarily) defined to be “the two methods are the same”, statistical insignificance means a favorable result.
    
    Does that about cover it? If so, that’s pretty bad.
    - Cyan 22 Feb 2010 1:46 UTC
      0 points
      0
      Parent
      
      each of which has at least 5% chance of happening
      
      That part isn’t right, but the rest is.
      - SilasBarta 22 Feb 2010 1:55 UTC
        0 points
        0
        Parent
        So I should have said “for the nine outcomes they considered, they all had at least 5% chance of happening”?
        [deleted] 22 Feb 2010 2:28 UTC
        1 point
        0
        Parent
        The p-value is the probability of getting a result “at least this extreme” given the null hypothesis, where “extreme” means “deviating from the null hypothesis”, however that’s defined. So, the test cut the outcome space into pieces, the most extreme of which had at least a 5% chance of happening.
        
        I think.
        Cyan 22 Feb 2010 2:33 UTC
        3 points
        0
        Parent
        
        the most extreme of which had at least a 5% chance of happening
        
        … under the null hypothesis. I actually forgot this detail when replying to komponisto.
        What links here?
        Cyan's comment on Case study: abuse of frequentist statistics by Cyan (21 Feb 2010 18:13 UTC; 10 points)
    - Psy-Kosh 22 Feb 2010 1:01 UTC
      0 points
      0
      Parent
      Wait… actually it may even be worse than that. I’m not even sure it’s cleanly partitioning the outcome space. ¹⁄₂₀ = .05, so if some outcomes are above .05, then other outcomes would have to be below .05, right?
      
      So the calculation to get the final result doesn’t even really do a proper partitioning of the outcomes if some of the outcomes can be greater than .05 and none less than .05
      
      EDIT: so yeah, it’s cutting up not just the outcome space into pieces corresponding to rankings, but mushing some of those together (at best).
    - Psy-Kosh 22 Feb 2010 0:38 UTC
      0 points
      0
      Parent
      That’s more or less my understanding of the situation.
      
      And yes… that is indeed pretty bad. :)
  - Cyan 22 Feb 2010 0:09 UTC
    2 points
    0
    Parent
    More of a double-headed coin situation, actually.
    - Psy-Kosh 22 Feb 2010 0:35 UTC
      0 points
      0
      Parent
      Well… different ranking outcomes (different sides of the coin) are possible. Just that the interpretation will always be “don’t reject the null hypothesis” but yeah. :)
      
      Either way, my overall reaction to your post is “yuck” (not your post itself! That I upvoted. I mean the whole situation… That a relatively standard statistical test could allow this sort of madness. I mean, I know frequentist stats isn’t the Bayesian way, but that relatively standard methods in it can be this pathological does not at all give me warm fuzzies)
      - soreff 23 Feb 2010 22:49 UTC
        1 point
        0
        Parent
        I concur with your “yuck”, but would phrase it slightly differently. The specific type of statistical test applied, plus the number of samples taken, has the effect, as Cyan said, of guaranteeing the results that the authors wanted. Note that, more generally, the fact that the authors chose to phrase their analysis so that accepting the null hypothesis was the result they wanted plus choosing a nonparametric statistical test, which is always weaker than a parametric one is in and of itself suspicious. If they had had enough samples so that it would be theoretically possible for the null hypothesis to be rejected (say if they had taken more samples) but they had still wanted the null result and they had still chosen a nonparametric test I would still be suspicious. As Cyan said, the nonparametric tests throw away most of the information.
      - PhilGoetz 25 Feb 2010 14:12 UTC
        −1 points
        0
        Parent
        It’s not the fault of the method if someone abuses it.
        wnoise 25 Feb 2010 18:15 UTC
        3 points
        0
        Parent
        In general, no. However, if a method is more easily abused than others, that that is something worth pointing out.