Experiment: a good researcher is hard to find

See previously “A good volunteer is hard to find”

Back in February 2012, lukeprog announced that SIAI was hiring more part-time remote researchers, and you could apply just by demonstrating your chops on a simple test: review the psychology literature on habit formation with an eye towards practical application. What factors strengthen new habits? How long do they take to harden? And so on. I was assigned to read through and rate the submissions and Luke could then look at them individually to decide who to hire. We didn’t get as many submissions as we were hoping for, so in April Luke posted again, this time with a quicker easier application form. (I don’t know how that has been working out.)

But in February, I remembered the linked post above from GiveWell where they mentioned many would-be volunteers did not even finish the test task. I did, and I didn’t find it that bad, and actually a kind of interesting exercise in critical thinking & being careful. People suggested that perhaps the attrition was due not to low volunteer quality, but to the feeling that they were not appreciated and were doing useless makework. (The same reason so many kids hate school…) But how to test this?

Simple! Tell people that their work was not useless and that even if they were not hired, their work would be used! And we could do Science by randomizing what people got the encouraging statement. The added paragraph looked like this:

The primary purpose of this project is to evaluate applicants on their ability to do the kind of work we need, but we’ll collate all the results into one good article on the subject, so even if we don’t hire you, you don’t have to feel your time was wasted.

Well, all the reviews have been read & graded as of yesterday, with submissions trickling in over months; I think everyone who was going to submit has done so, and it’s now time for the final step. So many people failed to send in any submission (only ~18 of ~40) that it’s relatively easy to analyze—there’s just not that much data!

So, the first question is, did people who got the extra paragraph do a better job of writing their review, as expressed in my rating it from 2-10?

Surprisingly, they did seem to—despite my expectation that any result would be noise as the sample is so small. If we code getting no paragraph as 0 and getting a paragraph as 1, and add the two scores to get 2-10, and strip out all personal info, you get this CSV. Load it up in R:

> mydata <- read.table("2012-feb-researcher-scores.csv", header=TRUE, sep=",") > t.test(Good~Extra, data=mydata) Welch Two Sample t-test data: Good by Extra t = -2.448, df = 14.911, p-value = 0.02723 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.028141 -0.277415 sample estimates: mean in group 0 mean in group 1 4.625000 6.777778

The result is not hugely robust: if you set the last score to 10 rather than 6, for example, the p-value falls to just 0.16. The effect size looks interesting though:

.... mean in group 0 mean in group 1 5.125000 6.777778 > sd(mydata$Good, TRUE) [1] 2.318405 > (6.7 - 5.125) / 2.32 [1] 0.6788793

0.67 isn’t bad.

The next question to me is, did the paragraph influence whether people would send in a submission at all? Re-editing the CSV, we load it up and analyze again:

> mydata <- read.table("2012-feb-researcher-completion.csv", header=TRUE, sep=",") > t.test(Received~Extra, data=mydata) Welch Two Sample t-test data: Received by Extra t = 0.1445, df = 36.877, p-value = 0.8859 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3085296 0.3558981 sample estimates: mean in group 0 mean in group 1 0.4736842 0.4500000

Nope. It’s somewhat robust since we can use everyone who applied; I have to flip like 6 values before the p-value goes down to 0.07.

So, lessons learned? It’s probably a good idea to include such a paragraph since it’s so cheap and apparently isn’t at the expense of submissions in general.