This is very similar to the EDA I would run on this sort of data—marginals, correlations, basic tests. A few things I would do differently:
I generally think in terms of standard devs/standard errors rather than p-vals/log odds updates, at least for EDA purposes. I then think about how many std devs I need for signficance by eyeballing normality and falling back to Chebyshev’s inequality if needed. For large datasets with highly significant results, that’s plenty. Otherwise, I need to think harder about what model actually makes sense.
That said, approximations of normality are still valid for your correlations—the raw data isn’t normal, but the correlation is a *sum* over (presumably) IID survey respondents. Same with group differences: the *average* can be modeled as normal, even if the data isn’t. Central limit theorem and all that. That said, I don’t use R, it’s possible their tests rely on additional assumptions.
I would look for clusters. With data like this, I’d wonder if there are separate populations filling in the survey. I’d run SVD and looks at plots/histograms of the larger components to see if there’s any visually-obvious clustering.
Is the dataset publicly available? If so, I might do a write-up of how I’d analyze it. Would be interesting to have a bunch of different people try it and compare notes. (I mean, it would probably eventually devolve into a dick-measuring competition, but even then it would be interesting.)
.I hear you, but R has enough fully-automated testing tools that it’s much simpler for me to just run the appropriate test and see what pops out the other end. (Also, THANK YOU for mentioning Chebyshev, I can’t believe I’d never heard of that inequality before and it’s EXACTLY my kind of thing)
.I think (?) you’re operating on the wrong level of meta here. A t-test uses both the mean and the variance of the distribution(s) you feed it, and that’s true whether or not it’s being used to test a correlation. The CLT will not save us, because the single (admittedly gaussian-distributed) datapoint representing the mean has a variance of zero. (Something I could have done—in fact, something I remember doing much earlier in my career, back when I was better at identifying problems than finding expedient solutions—was to group not-necessarily-normal datapoints together into batches of about twenty, take the averages per-batch, and then t-test the lists of those: it was a ridiculous waste of statistical power, but it was valid!)
.That’s an excellent idea. My excuse for not doing that is that I was prioritising pointedly-not-getting-things-wrong over actually-getting-things-right; my reason is that I just didn’t think of it and I’m too lazy (and data-purist) to go back and try that now.
This is very similar to the EDA I would run on this sort of data—marginals, correlations, basic tests. A few things I would do differently:
I generally think in terms of standard devs/standard errors rather than p-vals/log odds updates, at least for EDA purposes. I then think about how many std devs I need for signficance by eyeballing normality and falling back to Chebyshev’s inequality if needed. For large datasets with highly significant results, that’s plenty. Otherwise, I need to think harder about what model actually makes sense.
That said, approximations of normality are still valid for your correlations—the raw data isn’t normal, but the correlation is a *sum* over (presumably) IID survey respondents. Same with group differences: the *average* can be modeled as normal, even if the data isn’t. Central limit theorem and all that. That said, I don’t use R, it’s possible their tests rely on additional assumptions.
I would look for clusters. With data like this, I’d wonder if there are separate populations filling in the survey. I’d run SVD and looks at plots/histograms of the larger components to see if there’s any visually-obvious clustering.
Is the dataset publicly available? If so, I might do a write-up of how I’d analyze it. Would be interesting to have a bunch of different people try it and compare notes. (I mean, it would probably eventually devolve into a dick-measuring competition, but even then it would be interesting.)
Responses to your differences:
.I hear you, but R has enough fully-automated testing tools that it’s much simpler for me to just run the appropriate test and see what pops out the other end. (Also, THANK YOU for mentioning Chebyshev, I can’t believe I’d never heard of that inequality before and it’s EXACTLY my kind of thing)
.I think (?) you’re operating on the wrong level of meta here. A t-test uses both the mean and the variance of the distribution(s) you feed it, and that’s true whether or not it’s being used to test a correlation. The CLT will not save us, because the single (admittedly gaussian-distributed) datapoint representing the mean has a variance of zero. (Something I could have done—in fact, something I remember doing much earlier in my career, back when I was better at identifying problems than finding expedient solutions—was to group not-necessarily-normal datapoints together into batches of about twenty, take the averages per-batch, and then t-test the lists of those: it was a ridiculous waste of statistical power, but it was valid!)
.That’s an excellent idea. My excuse for not doing that is that I was prioritising pointedly-not-getting-things-wrong over actually-getting-things-right; my reason is that I just didn’t think of it and I’m too lazy (and data-purist) to go back and try that now.
The dataset is, at time of writing, still up at https://gist.github.com/ncase/74ae97cb74893a0c540274b44f550503. I’d love to see what you throw at it.