Experiment: a good researcher is hard to find

See pre­vi­ously “A good vol­un­teer is hard to find”

Back in Fe­bru­ary 2012, luke­prog an­nounced that SIAI was hiring more part-time re­mote re­searchers, and you could ap­ply just by demon­strat­ing your chops on a sim­ple test: re­view the psy­chol­ogy liter­a­ture on habit for­ma­tion with an eye to­wards prac­ti­cal ap­pli­ca­tion. What fac­tors strengthen new habits? How long do they take to harden? And so on. I was as­signed to read through and rate the sub­mis­sions and Luke could then look at them in­di­vi­d­u­ally to de­cide who to hire. We didn’t get as many sub­mis­sions as we were hop­ing for, so in April Luke posted again, this time with a quicker eas­ier ap­pli­ca­tion form. (I don’t know how that has been work­ing out.)

But in Fe­bru­ary, I re­mem­bered the linked post above from GiveWell where they men­tioned many would-be vol­un­teers did not even finish the test task. I did, and I didn’t find it that bad, and ac­tu­ally a kind of in­ter­est­ing ex­er­cise in crit­i­cal think­ing & be­ing care­ful. Peo­ple sug­gested that per­haps the at­tri­tion was due not to low vol­un­teer qual­ity, but to the feel­ing that they were not ap­pre­ci­ated and were do­ing use­less make­work. (The same rea­son so many kids hate school…) But how to test this?

Sim­ple! Tell peo­ple that their work was not use­less and that even if they were not hired, their work would be used! And we could do Science by ran­dom­iz­ing what peo­ple got the en­courag­ing state­ment. The added para­graph looked like this:

The pri­mary pur­pose of this pro­ject is to eval­u­ate ap­pli­cants on their abil­ity to do the kind of work we need, but we’ll col­late all the re­sults into one good ar­ti­cle on the sub­ject, so even if we don’t hire you, you don’t have to feel your time was wasted.

Well, all the re­views have been read & graded as of yes­ter­day, with sub­mis­sions trick­ling in over months; I think ev­ery­one who was go­ing to sub­mit has done so, and it’s now time for the fi­nal step. So many peo­ple failed to send in any sub­mis­sion (only ~18 of ~40) that it’s rel­a­tively easy to an­a­lyze—there’s just not that much data!

So, the first ques­tion is, did peo­ple who got the ex­tra para­graph do a bet­ter job of writ­ing their re­view, as ex­pressed in my rat­ing it from 2-10?

Sur­pris­ingly, they did seem to—de­spite my ex­pec­ta­tion that any re­sult would be noise as the sam­ple is so small. If we code get­ting no para­graph as 0 and get­ting a para­graph as 1, and add the two scores to get 2-10, and strip out all per­sonal info, you get this CSV. Load it up in R:

> my­data ← read.table(“2012-feb-re­searcher-scores.csv”, header=TRUE, sep=”,”) > t.test(Good~Ex­tra, data=my­data) Welch Two Sam­ple t-test data: Good by Ex­tra t = −2.448, df = 14.911, p-value = 0.02723 al­ter­na­tive hy­poth­e­sis: true differ­ence in means is not equal to 0 95 per­cent con­fi­dence in­ter­val: −4.028141 −0.277415 sam­ple es­ti­mates: mean in group 0 mean in group 1 4.625000 6.777778

The re­sult is not hugely ro­bust: if you set the last score to 10 rather than 6, for ex­am­ple, the p-value falls to just 0.16. The effect size looks in­ter­est­ing though:

…. mean in group 0 mean in group 1 5.125000 6.777778 > sd(my­data$Good, TRUE) [1] 2.318405 > (6.7 − 5.125) /​ 2.32 [1] 0.6788793

0.67 isn’t bad.

The next ques­tion to me is, did the para­graph in­fluence whether peo­ple would send in a sub­mis­sion at all? Re-edit­ing the CSV, we load it up and an­a­lyze again:

> my­data ← read.table(“2012-feb-re­searcher-com­ple­tion.csv”, header=TRUE, sep=”,”) > t.test(Re­ceived~Ex­tra, data=my­data) Welch Two Sam­ple t-test data: Re­ceived by Ex­tra t = 0.1445, df = 36.877, p-value = 0.8859 al­ter­na­tive hy­poth­e­sis: true differ­ence in means is not equal to 0 95 per­cent con­fi­dence in­ter­val: −0.3085296 0.3558981 sam­ple es­ti­mates: mean in group 0 mean in group 1 0.4736842 0.4500000

Nope. It’s some­what ro­bust since we can use ev­ery­one who ap­plied; I have to flip like 6 val­ues be­fore the p-value goes down to 0.07.

So, les­sons learned? It’s prob­a­bly a good idea to in­clude such a para­graph since it’s so cheap and ap­par­ently isn’t at the ex­pense of sub­mis­sions in gen­eral.