gwern comments on Participation in the LW Community Associated with Less Bias

gwern 10 Dec 2012 1:09 UTC
3 points
0
Out of curiosity, I looked at what a more appropriate logistic regression would say (using this guide); given the categorical variable of the question answer, can one predict how many survey entries were missing/omitted (as a proxy for time investment)? The numbers and method are a little different from a t-test, and the result is a little less statistically significant, but as before there’s no real relationship*:
```
R> lw <- read.csv("2012.csv")
R> lw$MissingAnswers <- apply(lw, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> lw <- lw[as.character(lw$CFARQuestion1) != " " & !is.na(as.character(lw$CFARQuestion1)),]
R> lw <- data.frame(lw$CFARQuestion1, lw$MissingAnswers)
R> summary(glm(lw.CFARQuestion1 ~ lw.MissingAnswers, data = lw, family = "binomial"))

Deviance Residuals:
   Min      1Q  Median      3Q     Max
 -1.17   -1.12   -1.05    1.23    1.41

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)        0.00111    0.12214    0.01     0.99
lw.MissingAnswers -0.00900    0.00607   -1.48     0.14

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1366.6  on 989  degrees of freedom
Residual deviance: 1364.4  on 988  degrees of freedom
AIC: 1368

Number of Fisher Scoring iterations: 3
```
* a note to other analyzers: it’s really important to remove null answers/NAs because they’ll show relationships all over the place. In this example, if you leave NAs in for the CFARQuestion1 field, you’ll wind up getting a very statistically significant relationship—because every CFARQuestion left NA by definition increases MissingAnswers by 1! And people who didn’t answer that question probably didn’t answer a lot of other questions, so the NA respondents enable a very easy reliable prediction of MissingAnswers…