cupholder

Karma: 839

cupholder 22 Feb 2010 2:57 UTC
10 points
on: Case study: abuse of frequentist statistics
I’m not seeing why what you call “the real WTF” is evidence of a problem with frequentist statistics. The fact that the hypothesis test would have given a statistically insignificant p-value whatever the actual 6 data points were just indicates that whatever the population distributions, 6 data points are simply not enough to disconfirm the null hypothesis. In fact you can see this if you look at Mann & Whitney’s original paper! (See the n=3 subtable in table I, p. 52.)

I can picture someone counterarguing that this is not immediately obvious from the details of the statistical test, but I would hope that any competent statistician, frequentist or not, would be sceptical of a nonparametric comparison of means for samples of size 3!

cupholder 22 Feb 2010 3:33 UTC
3 points
in reply to: komponisto’s comment on: Case study: abuse of frequentist statistics

Now, if that is a fair summary, then this big controversy between frequentists and Bayesians must mean that there is a sizable collection of people who think that the above procedure is a better way of obtaining knowledge than performing Bayesian updates.

Not necessarily better. Just more convenient for the thumbs up/thumbs down way of looking at evidence that scientists tend to like.

But for the life of me, I can’t see how anyone could possibly think that. I mean, not only is the “p-value” threshold arbitrary,

It’s a convention. The point is to have a pre-agreed, low significance level so that testers can’t screw with the result of a test by arbitrary jacking the significance level up (if they want to reject a hypothesis) or turning it down (if they don’t). The significance level has to be low to minimize the risk of a type I error.

not only are we depriving ourselves of valuable information by “accepting” or “not accepting” a hypothesis rather than quantifying our certainty level,

The certainty level is effectively communicated via the significance level and p-value itself. (And the use of a reject vs. don’t reject dichotomy can be desirable if one wishes to decide between performing some action and not performing it based on some data.)

but...what about P(E|H)?? (Not to mention P(H).) To me, it seems blatantly obvious that an epistemology (and that’s what it is) like the above is a recipe for disaster—specifically in the form of accumulated errors over time.

A frequentist can deal in likelihoods, for example by doing hypothesis tests of likelihood ratios. As for priors, a frequentist encapsulates them in parametric and sampling assumptions about the data. A Bayesian might give a low weight to a positive result from a parapsychology study because of their “low priors”, but a frequentist might complain about sampling procedures or cherrypicking being more likely than a true positive. As I see it, the two say essentially the same thing; the frequentist is just being more specific than the Bayesian.

cupholder 22 Feb 2010 3:49 UTC
14 points
in reply to: Cyan’s comment on: Case study: abuse of frequentist statistics

Check out the title: abuse of frequentist statistics. Yes, at the end, I argue from a Bayesian perspective, but you don’t have to be a Bayesian to see the structural problems with frequentist statistics as currently taught to and practiced by working scientists.

I agree that frequentist statistics are often poorly taught and understood, and that this holds however you like to do your statistics. Still, the main post feels to me like a sales pitch for Bayes brand chainsaws that’s trying to scare me off Neyman-Pearson chainsaws by pointing out how often people using Neyman-Pearson chainsaws accidentally cut off a limb with them. (I am aware that I may be the only reader who feels this way about the post.)

(Does the fact that when I saw the sample size the word “underpowered” instantly jumped into my head count as evidence that I am competent?)

Yes, but it is not sufficient evidence to reject the null hypothesis of incompetence at the 0.05 significance level. (I keed, I keed.)
What links here?
- Cyan's comment on Open Thread: March 2010 by AdeleneDawner (1 Mar 2010 14:04 UTC; 24 points)

cupholder 22 Feb 2010 4:32 UTC
5 points
in reply to: Cyan’s comment on: Case study: abuse of frequentist statistics

No. P-values are not equivalent when they are calculated using different statistics, or even the same statistic but a different sample size. On the latter point see Royall, 1986.

Sorry. You are quite right, and I was sloppy. I had in mind the implicit idea that holding the choices of statistical test and data collection procedure constant, different p-values suggest how strongly one should reject the null hypothesis, and I should have made that explicit. It is absolutely true that if I just ask someone, “Test A gave me p = 0.008 and Test B gave me p = 0.4, which test’s null hypothesis is worse off?”, the correct answer is “how should I know?”

I’d say the frequentist is using Bayesian reasoning informally; Jaynes discusses this exact problem from a Bayesian perspective at the beginning of Chapter 5 of his magnum opus.

Yep. I think this is an example of the frequentist encapsulating what a Bayesian would call priors in their sampling assumptions.

cupholder 22 Feb 2010 4:40 UTC
3 points
in reply to: Cyan’s comment on: Case study: abuse of frequentist statistics
That’s true, and having been reminded of that, I think I may have been unduly pedantic about a fine detail at the expense of the main point.

cupholder 28 Feb 2010 13:28 UTC
8 points
in reply to: bill’s comment on: What is Bayesianism?
Heh, that’s a cheeky example. To explain why it’s cheeky, I have to briefly run through it, which I’ll do here (using Jaynes’s symbols so whoever clicked through and has pages 22-24 open can directly compare my summary with Jaynes’s exposition).

Call N the sample size and θ the minimum possible widget lifetime (what bill calls X). Jaynes first builds a frequentist confidence interval around θ by defining the unbiased estimator θ∗, which is the observations’ mean minus one. (Subtracting one accounts for the sample mean being >θ.) θ∗’s probability distribution turns out to be y^(N-1) exp(-Ny), where y = θ∗ - θ + 1. Note that y is essentially a measure of how far our estimator θ∗ is from the true θ, so Jaynes now has a pdf for that. Jaynes integrates that pdf to get y’s cdf, which he calls F(y). He then makes the 90% CI by computing [y1, y2] such that F(y2) - F(y1) = 0.9. That gives [0.1736, 1.8259]. Substituting in N and θ∗ for the sample and a little algebra (to get a CI corresponding to θ∗ rather than y) gives his θ CI of [12.1471, 13.8264].

For the Bayesian CI, Jaynes takes a constant prior, then jumps straight to the posterior being N exp(N(θ - x1)), where x1′s the smallest lifetime in the sample (12 in this case). He then comes up with the smallest interval that encompasses 90% of the posterior probability, and it turns out to be [11.23, 12].

Jaynes rightly observes that the Bayesian CI accords with common sense, and the frequentist CI does not. This comparison is what feels cheeky to me.

Why? Because Jaynes has used different estimators for the two methods [edit: I had previously written here that Jaynes implicitly used different estimators, but this is actually false; when he discusses the example subsequently (see p. 25 of the PDF) he fleshes out this point in terms of sufficient v. non-sufficient statistics.]. For the Bayesian CI, Jaynes effectively uses the minimum lifetime as his estimator for θ (by defining the likelihood to be solely a function of the smallest observation, instead of all of them), but for the frequentist CI, he explicitly uses the mean lifetime minus 1. If Jaynes-as-frequentist had happened to use the maximum likelihood estimator—which turns out to be the minimum lifetime here—instead of an arbitrary unbiased estimator he would’ve gotten precisely the same result as Jaynes-as-Bayesian.

So it seems to me that the exercise just demonstrates that Bayesianism-done-slyly outperformed frequentism-done-mindlessly. I can imagine that if I had tried to do the same exercise from scratch, I would have ended up faux-proving the reverse: that the Bayesian CI was dumber than the frequentist’s. I would’ve just picked up a boring, old-fashioned, not especially Bayesian reference book to look up the MLE, and used its sampling distribution to get my frequentist CI: that would’ve given me the common sense CI [11.23, 12]. Then I’d construct the Bayesian CI by mechanically defining the likelihood as the product of the individual observations’ likelihoods. That last step, I am pretty sure but cannot immediately prove, would give me a crappy Bayesian CI like [12.1471, 13.8264], if not that very interval.

Ultimately, at least in this case, I reckon your choice of estimator is far more important than whether you have a portrait of Bayes or Neyman on your wall.

[Edited to replace my asterisks with ∗ so I don’t mess up the formatting.]

cupholder 28 Feb 2010 16:09 UTC
0 points
in reply to: Cyan’s comment on: What is Bayesianism?
You got me: I didn’t read the what-went-wrong subsection that follows the example. (In my defence, I did start reading it, but rolled my eyes and stopped when I got to the claim that “there must be a very basic fallacy in the reasoning underlying the principle of confidence intervals”.)

I suspect I’m not the only one, though, so hopefully my explanation will catch some of the eyeballs that didn’t read Jaynes’s own post-mortem.

[Edit to add: you’re almost certainly right about the real point of the story, but I think my reply was fair given the spirit in which it was presented here, i.e. as a frequentism-v.-Bayesian thing rather than an orthodox-statisticians-are-taught-badly thing.]

cupholder 28 Feb 2010 19:03 UTC
0 points
in reply to: Cyan’s comment on: What is Bayesianism?

Independently reproducing Jaynes’s analysis is excellent, but calling him “cheeky” for “implicitly us[ing] different estimators” is not fair given that he’s explicit on this point.

I was wrong to say that Jaynes implicitly used different estimators for the two methods. After the example he does mention it, a fact I missed due to skipping most of the post-mortem. I’ll edit my post higher up to fix that error. (That said, at the risk of being pedantic, I did take care to avoid calling Jaynes-the-person cheeky. I called his example cheeky, as well as his comparison of the frequentist CI to the Bayesian CI, kinda.)

It’s a frequentism-v.-Bayesian thing to the extent that correct coverage is considered a sufficient condition for good frequentist statistical inference. This is the fallacy that you rolled your eyes at; the room full of shocked frequentists shows that it wasn’t a strawman at the time. [ETA: This isn’t quite right. The “v.-Bayesian” part comes in when correct coverage is considered a necessary condition, not a sufficient condition.]

When I read Jaynes’s fallacy claim, I didn’t interpret it as saying that treating coverage as necessary/sufficient was fallacious; I read it as arguing that the use of confidence intervals in general was fallacious. That was made me roll my eyes. [Edit to clarify: that is, I was rolling my eyes at what I felt was a strawman, but a different one to the one you have in mind.] Having read his post-mortem fully and your reply, I think my initial, eye-roll-inducing interpretation was incorrect, though it was reasonable on first read-through given the context in which the “fallacy” statement appeared.

cupholder 17 Mar 2010 2:36 UTC
7 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism
You appear to be referring to Nisbett’s paragraph starting with

Hedges and Nowell (1998) found improvement on almost all tests for African American 12th graders compared with other 12th graders over the period 1965-1994.

A few sentences below that Nisbett refers to NAEP data to say that the reading score gap could be gone in 25 years and the science score gap in 75 years, if trends continue. [ETA: this is the ‘largest study’ that Nisbett cites. I’m sad Nisbett didn’t give a more specific citation for it.]

The page you link appears to have data on the NAEP tests, but only for the mathematics tests. Clicking on the ‘White-Black Gap’ button, and then on the ‘Age 17’ tab (as Nisbett refers to 12th graders, so I am guessing that is what he and you are talking about...?) shows
- a 1973 gap of 40 points
- a 1982 gap of 32 points
- a 1986 gap of 29 points
- a 1990 gap of 21 points
- then some fluctuations between 26 and 31 points until the most recent survey (2008), which has a 26 point gap
The data linked do not appear to bear strongly on Nisbett’s claims about the NAEP data (because Nisbett refers to the reading and science NAEP scores, not math), and I am also having difficulty seeing the ‘small narrowing of the black/white gap between 1973 and 1982 and a fairly consistent gap thereafter.’ in the data linked.

All in all, I am having difficulty substantiating your claim that Nisbett’s claim is unsubstantiated by the data. I suspect either I am not interpreting your comment correctly, or the link in it happens to point to a data set other than the one you intended. Could you clarify?

(About the bigger question of whether black-white IQ differences have narrowed recently, it may be informative to read William Dickens and James Flynn’s 2006 paper, which takes IQ test norming data and shows a narrowing of the IQ gap between 1972 and 2002. (Rushton and Jensen disagreed with the conclusions of that paper, but I find Dickens and Flynn’s rebuttal convincing.)

cupholder 17 Mar 2010 14:44 UTC
5 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism

The one graph I looked at at random doesn’t seem to support the claim that the gap (generally speaking) is narrowing and headed towards disappearing. Agreed?

When I see your random graph, I see the gap halving[!] from 1973 to 1990, widening through the 1990s, and maybe gradually shrunking since then. I see contradictory trends over the past 40 years, but it’s more likely than not that the gap has resumed narrowing. So I’m not sure I do agree with you.

Since you write ‘generally speaking’ I guess you might be asking about the general trend as a whole from 1973 to now. I reckon that’s an overall shrinking trend too.

To check my gut feeling more systematically, I did a quick regression of the score gap against year. (Not the best way to do it, but it beats eyeballing.) That gets me a .35 or .36 point shrinking per year depending on which assessment format I use for 2004. At that rate, the current gap (26 points in ’08) would disappear in 70 to 75 years.

That’s the same time period Nisbett gives for the disappearance of the science score gap, which I think is evidence against Nisbett ‘cherry-picking’ - if he cut out data because it had gaps that closed too slowly for his hypothesis, he would’ve left out the science data as well as the math data.

Summing up, I think I fundamentally disagree with you on the most likely interpretation of your graph.

cupholder 18 Mar 2010 2:08 UTC
5 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism
Aha, I misunderstood which chart you had in mind. I thought that your link was intended to go to the data for 17 year olds, but that you were unable to link it directly because the page used Javascript to flip between the charts for different ages. I see now I’m wrong about that—one can link directly to the chart for each age, and it sounds like you were pointing to the age 9 data.

So I’ll try this again with the 9 year olds. I’ve taken the liberty of looking at the black-white gap graph instead of the scale score graph so I don’t have to do any mental arithmetic to get the gap size at each testing. Looks to me like the gap consistently narrowed from 1973 to 1986, and has fluctuated from 1986 so it’s sometimes wider, sometimes thinner, but no overall trend since then.

Regressing gap size on year like I did before gives a shrinking of .24 or .25 points per year. So the picture is more mixed than for the older kids: there’s an overall shrinking, but it’s only two-thirds what you get for 17 year olds, and the trend looks like it’s stalled since the late 80s.

Still, I am not sure that this means Nisbett is wrong. Looking at the bit of Nisbett you quote yourself downthread, Nisbett does not seem to say anything about the math scores, which means looking at the math scores would not tell us whether Nisbett is wrong or right.

It is possible that Nisbett cherry-picked by ignoring the math data, but I think a .25 point per year narrowing is still evidence against that idea. At a quarter point per year, the math gap would disappear in about a century, which isn’t much longer than the 75 years Nisbett suggests for science.

cupholder 18 Mar 2010 2:25 UTC
11 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism

That’s only if you feel you need to rely on scientific studies to reach conclusions. Some things don’t require such a study.

Yes, but you have to be super careful when deciding which things need scientific studies.

A few years ago I would’ve said women were so much more chatty than men—and that the difference in chattiness was so obvious—that it would be a waste of time to check it out scientifically. But sometimes, when you check things out systematically, you’re surprised. I think the argument about blacks, whites and IQ is a bit like that, although that argument is more about the cause of the differences and not their mere existence.

cupholder 18 Mar 2010 4:18 UTC
1 point
in reply to: NancyLebovitz’s comment on: Undiscriminating Skepticism
I imagine it’s less widespread a belief than before the 80s, but it’s just one of those things you get by osmosis from the broader culture when you’re young. It’s part of the stereotypes there are about the sexes: women can’t drive, men won’t ask for directions when they’re lost, blah blah blah.

cupholder 18 Mar 2010 23:18 UTC
0 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism
When I suggested being ‘super careful’ I meant being super careful about deciding which things are so obvious as to not need systematic debate and study in the first place, not about deciding how skeptical to be of certain ‘sides’ or conclusions in a debate.

cupholder 18 Mar 2010 23:28 UTC
4 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism

Of course there are ways to interpret the graph to argue that the gap is narrowing and on track to disappear, but if you look at it and use your common sense, it’s just not a reasonable conclusion.

You put more trust in your common sense than I do. I try to avoid depending exclusively on what my common sense infers from eyeballing noisy time series—that way lies ’global warming stopped in 1998’esque error.

I find your preferred interpretation reasonable, but I don’t see why it would be unreasonable to look at the entire data and see a net narrowing. (Especially if we lacked the 2008 data, as Nisbett did.)

cupholder 19 Mar 2010 1:05 UTC
9 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism
I’ll try and clarify with the non-race and IQ related example that first put the idea into my head: gravity. The idea of things falling to the floor is so obvious to me, and agrees so well with my common sense, that I would not even bother to debate somebody who wanted to argue that things don’t fall to the floor. That’s the behaviour I’m saying it’s a good idea to be super careful about: rejecting challenges to your existing view out of hand.

Stepping back to the race and IQ argument, I’m saying that I would exercise a lot of care before I put the argument into the ‘no need to even bother debating it’ box. Having entered into the debate, though, I would be content to apply my ordinary standards of evidence to the different ‘sides’ in the debate. I mean the ‘super careful’ warning to apply pre-debate, not during the debate.

cupholder 19 Mar 2010 1:36 UTC
4 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism

Here’s a thought experiment: You show the graph I linked to to 10 statisticians, except you replace the labels with something less politically charged. For example, the price of winter wheat versus the price of summer wheat. And you ask them to interpret the graph as far as long term trends go. I’m pretty confident that 10 out of 10 would interpret the graph the same way I did.

I am far less confident.

Ditto for global surface temperatures. Take the temperature label off the graph and tell people it’s the dollar to yen exchange rate. I bet 10 out of 10 statisticians will say the rate is basically flat for the last 10 years.

I bet it would depend on exactly which data set you gave them. Do you give them data for the past 10 years, data since 1998, the data since they started measuring temperatures with satellites as well as thermometers, or the longest-running data set, which runs from 1850 onwards? If you just give them the last decade of data, they might well just write it off as flat and noisy, but if you let them judge the recent numbers in the context of the entire time series, they might recognize them as flat-looking fuzz obscuring an ongoing linear trend.

If the choice is between trusting your common sense and trusting someone with an agenda, I would say go with your common sense.

That sounds nice, but I don’t know how practical that would turn out to be, in this case or in general. In this particular case, how can I even tell with certainty whether you have ‘an agenda’ or not? And what if the key participants in a debate all have some agenda? It’s very possible that Nisbett has a ‘politically correct’ (not that I like the phrase, but I can’t think of a better way of putting it) agenda, and that Rushton and Jensen have a ‘politically incorrect’ agenda. How do I know, and what do I do if they do? And so on.

cupholder 19 Mar 2010 22:42 UTC
4 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism

How can you tell anything with certainty? The fact is that you can’t. Respectfully, it seems to me you are playing the “I’m such a skeptic” game.

Sorry. I was being sloppy in my earlier comment, and using ‘certainty’ as a shorthand for ‘certainty enough for me to label you as Having An Agenda, and therefore to reject your interpretation of the data as Tainted With An Agenda.’ It is of course true that you can’t tell anything inductive with cast-iron 100% certainty, but what I’m getting at is the question of how to get to what you or I would practically treat as certainty (like if I put a 95% probability on someone Having An Agenda).

Let me rephrase: in this particular case, how can I even tell whether you have ‘an agenda’ with sufficient certainty to disregard whatever you say about the data, and retreat to my own common sense gut feeling?

Let me ask you this: Do you seriously doubt that Nisbett has an agenda?

Do I doubt he has an agenda in the sense that he believes he’s right? A tiny bit, but only in the sense that I am never completely sure of another person’s motivation for stating something.

Do I doubt he has an agenda in the sense that he wants to convince other people of what he believes? Not really.

Do I doubt he has an agenda in the sense that he has an emotional investment in the argument as well as rational considerations? Only a little...but then again, who doesn’t get emotionally invested in arguments?

Do I doubt he has an agenda in the sense that he has political motivations for his article as well as self-centered emotional and rational ones? Quite a lot, actually. I don’t think I could reliably tell Nisbett’s emotional motivations apart from those that spring from his political agenda (whatever that is—Nisbett sounds like a leftist to me, but how the hell do I really know? There were rightists who crapped on The Bell Curve too.) Does it even make sense to distinguish the two? I’m not sure. (I suddenly feel that these are good questions to think about. Thank you for prodding me into thinking of them.)

Also, for whatever it’s worth, I am just as sure that Rushton and Jensen have ‘an agenda,’ however you want to define that, as Nisbett does. Do I throw all their papers out and just go with my common sense?

To clarify, this doesn’t mean I can’t get behind the idea of being alert to other people’s biases on some subject, but I’m not willing to push that to the point of a dichotomy between my common sense vs. someone with an agenda. Taking the global warming example, I’m sure many climate scientists have ‘an agenda,’ but I’d still tend to accept their consensus interpretation of the data than my own common sense where the two differ, and I think that’s reasonable if I don’t have time to dig through all of the research myself.

I would give them the data since the 1970s when sattelite measurement became possible.

In that case I think I’m roughly 90% confident that fewer than ’10 out of 10 statisticians will say the rate is basically flat for the last 10 years’. I am interpreting ‘the rate is flat here’ to mean that the net temperature trend is flat over time, as I believe we’re talking about whether global warming is continuing and not whether global warming is accelerating. (Thought process here: I reckon a randomly selected statistician has at most a 4 in 5 chance of deciding that temperatures have been ‘basically flat’ for the last 10 years’ based on the satellite data. Then the chance of 10 random statisticians all saying temperatures have been flat is 11%, so an 89% chance of at least one of them dissenting.)

cupholder 19 Mar 2010 22:49 UTC
5 points
in reply to: brazil84’s comment on: Undiscriminating Skepticism
I’m still waiting for y’all to agree on what God is so I can decide. Everyone seems to have a different idea of the bugger. In the meantime I’ll carry on spending brain energy on less fuzzy things, like race and IQ and global warming.
What links here?
- cupholder's comment on Undiscriminating Skepticism by Eliezer Yudkowsky (24 Mar 2010 18:48 UTC; 7 points)

cupholder 19 Mar 2010 22:56 UTC
4 points
in reply to: RobinZ’s comment on: Undiscriminating Skepticism

Every 10-year trendline in cupholder’s data was increasing.

A quick clarification: for each of the data links I posted there, the trendline is calculated based on all of the data that’s shown, i.e. for the post-1998 data the trendline is based on the last twelve years, for the post-1970s data the trendline is based on all of the post-1970s data, and so on. In other words, only the data for the last 10 years of data really have a 10-year trendline.

[ETA: Unless you mean you calculated 10-year trendlines for each data set yourself, in which case feel free to disregard this.]