Jonas Moss comments on Statistical suggestions for mech interp research and beyond

Jonas Moss 14 Sep 2025 10:59 UTC
2 points
0
Most statistical tests come with many assumptions. Pearson correlations technically assume: (1) continuous variables, (2) linear relationships, (3) bivariate normality, meaning that the distribution forms an elliptical cloud, (4) homoscedasticity, meaning that the variance of each variable is stable as the other variables change, and (5) no extreme outliers. Evaluation of statistical significance more generally assumes (6) independence among observations.
You should be more precise here! The Pearson correlation between two variables $X$ and $Y$ is defined, and makes sense as a measure of linear association, provided only that the variance of both variables is finite. The most commonly used tests and confidence intervals (Fisher’s transform, Pearson t-test) are valid under bivariate normality -- which implies the relationship between $X$ and $Y$ is linear, homoskedastic errors, no outliers, and elliptical contours. As I’m sure you know normality is not equivalent to the distribution having elliptical contours though.
That said, these normality-based test are not robust to general non-normality. You might want to have a look at the Hawkins paper below to see why. Essentially the asymptotic variance of the correlation coefficient depends on the mixed fourth order normalized moments of the data generating process. If these are large / small the resulting tests or confidence intervals can be arbitrarily widely off in either direction. If you care about simulations here, which are actually not needed because the math is so clean, you may use e.g. a multivariate t-distribution as your data-generating process and simulate normality-based confidence intervals with very poor coverage; then we may use the formula of Hawkins below to understand exactly why the coverage is so poor.
By the way, the paper you referred is essentially about testing for 0 correlation when $X$ and $Y$ are independent. In this case we can inspect the Hawkins formula below and observe that, yeah, the only relevant mixed moment is $m_22=1$ by independence, and the resulting asymptotic variance is equal to the normality-implied one, no matter the distribution of $X$ and $Y$. The results in the paper are entirely in line with Hawkins’ math, and probably also less obscure regression theory math, but they are not informative about the general performance of inferential procedures based on normality. For when testing for correlations we are probably interested in more general cases than $X$ and $Y$ being independent, and definitely so if we construct confidence intervals. More relevant and informative simulations may be found in the Bishara paper below, see Figure 1.
So are there any decent plug-and-play methods for this problem? I’d suggest just doing a simple percentile non-parametric bootstrap. The coverage isn’t perfect, but the method is standard, easy to use, and you will avoid extreme undercoverage.
Hawkins, D. L. (1989). Using U statistics to derive the asymptotic distribution of fisher’s Z statistic. The American Statistician, 43(4), 235–237. https://doi.org/10.1080/00031305.1989.10475666
Bishara, A. J., Li, J., & Nash, T. (2018). Asymptotic confidence intervals for the Pearson correlation via skewness and kurtosis. The British Journal of Mathematical and Statistical Psychology, 71(1), 167–185. https://doi.org/10.1111/bmsp.12113
- Paul Bogdan 3 Oct 2025 19:16 UTC
  1 point
  0
  Parent
  Sorry for this slow reply! Skewness and kurtosis are certainly relevant to the reliability of the p-value. Admittedly, when talking about statistical assumptions, I’ve usually made this point about not needing to worry about assumptions in the context of t-tests comparing two groups, where I don’t think this point would apply.
  > So are there any decent plug-and-play methods for this problem?
  Would a Spearman correlation likewise address that? It wouldn’t be measuring the same thing as a Pearson correlation with bootstrapping for the CI and p-value, as you say, but a Spearman correlation is what I’d lean for as an easy fix (and Spearman correlations, I think, are often fine as a default option when exploring data)