Out of curiosity, I looked at what a more appropriate logistic regression would say (using this guide); given the categorical variable of the question answer, can one predict how many survey entries were missing/omitted (as a proxy for time investment)? The numbers and method are a little different from a t-test, and the result is a little less statistically significant, but as before there’s no real relationship*:
R> lw <- read.csv("2012.csv")
R> lw$MissingAnswers <- apply(lw, 1, function(x) sum(sapply(x, function(y) is.na(y) || as.character(y)==" ")))
R> lw <- lw[as.character(lw$CFARQuestion1) != " " & !is.na(as.character(lw$CFARQuestion1)),]
R> lw <- data.frame(lw$CFARQuestion1, lw$MissingAnswers)
R> summary(glm(lw.CFARQuestion1 ~ lw.MissingAnswers, data = lw, family = "binomial"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.17 -1.12 -1.05 1.23 1.41
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.00111 0.12214 0.01 0.99
lw.MissingAnswers -0.00900 0.00607 -1.48 0.14
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1366.6 on 989 degrees of freedom
Residual deviance: 1364.4 on 988 degrees of freedom
AIC: 1368
Number of Fisher Scoring iterations: 3
* a note to other analyzers: it’s really important to remove null answers/NAs because they’ll show relationships all over the place. In this example, if you leave NAs in for the CFARQuestion1 field, you’ll wind up getting a very statistically significant relationship—because every CFARQuestion left NA by definition increases MissingAnswers by 1! And people who didn’t answer that question probably didn’t answer a lot of other questions, so the NA respondents enable a very easy reliable prediction of MissingAnswers…
Out of curiosity, I looked at what a more appropriate logistic regression would say (using this guide); given the categorical variable of the question answer, can one predict how many survey entries were missing/omitted (as a proxy for time investment)? The numbers and method are a little different from a t-test, and the result is a little less statistically significant, but as before there’s no real relationship*:
* a note to other analyzers: it’s really important to remove null answers/NAs because they’ll show relationships all over the place. In this example, if you leave NAs in for the
CFARQuestion1
field, you’ll wind up getting a very statistically significant relationship—because everyCFARQuestion
left NA by definition increasesMissingAnswers
by 1! And people who didn’t answer that question probably didn’t answer a lot of other questions, so the NA respondents enable a very easy reliable prediction ofMissingAnswers
…