Good Samaritans in experiments

Con­sider 2 peo­ple. Both are sem­i­nary stu­dents who are tak­ing part in an ex­per­i­ment os­ten­si­bly to con­sider differ­ent types of re­li­gios­ity. One is asked to pre­pare a short talk on the Good Sa­mar­i­tan, the other on po­ten­tial fu­ture ca­reers for sem­i­nar grad­u­ates.

They are both told to go to an­other room to record their talk. The one who is to be giv­ing a talk on the Good Sa­mar­i­tan is told that he is late and needs to hurry. The other par­ti­ci­pant is told that he has a time to spare.

If they, sep­a­rately, come across some­one who ap­pears to be in res­pi­ra­tory dis­tress, which do you think is more likely to stop and help?

Does be­ing in a hurry de­ter­mine whether some­one helps?

Does read­ing the Good Sa­mar­i­tan?

Which is a big­ger effect?

I was re­cently told about an ex­per­i­ment which showed that sem­i­nary stu­dents who had just pre­pared to give a talk about the Good Sa­mar­i­tan were no more likely to help some­one in need than those who had been prepar­ing a talk about an un­re­lated topic.

This seemed un­ex­pected to me – peo­ple who had just been read­ing and think­ing about a story which was told speci­fi­cally by the leader of their faith to in­struct them to help other peo­ple were no more likely to help than the con­trol? I know hu­man­ity is crazy but seemed like a new level of crazy which I wouldn’t have pre­dicted.

So I thought I’d check out the study and – Aaaaaaaaaaaaaaaaaah!!!

I know get­ting overly up­set about bad ex­per­i­ments (es­pe­cially those from be­fore the repli­ca­tion crisis) is prob­a­bly bad for my health but still – Aaaaaaaaaaaaaaaaaaaaaaaah!!!

I don’t want to be too harsh on the au­thors as this prob­a­bly isn’t the worst culprit you’ll see but – Aaaaaaaaaaaaaaah!!!

The pa­per has 1811 cita­tions listed on google scholar – Aaaaaaaaaaaaaaah!!!

I’m tempted to pre­tend that this post has some pur­pose other than just as a re­lease of my frus­tra­tion but that would be dishon­est. Please con­sider this post a form of ther­apy for me. The work­ing ti­tle for this post was “Scream­ing into the void”—con­sider your­self warned.

(If you want a more co­her­ent dis­cus­sion of com­mon mi­suse of statis­tics in re­search pa­pers I highly recom­mend putanu­monit’s defense against the dark arts se­ries)

The Experiment

Ok, so the ba­sic premise of the ex­per­i­ment seems to be sound. We want to know what in­puts cause peo­ple to be more or less likely to help oth­ers:

1. Plan­ning a talk on the Good Sa­mar­i­tan (GS)

2. Be­ing in a hurry

3. Type of re­li­gios­ity (Reli­gion as quest, means or end)

The setup is to give peo­ple a ques­tion­naire to de­ter­mine their type of re­li­gios­ity. Then give them some time to plan a short talk (3-5 mins) on GS or an un­re­lated topic. They are then asked to go to an­other room to give the talk (with 3 de­grees of ur­gency – low, medium and high).

Con­trary to the ex­am­ple given in the in­tro­duc­tion, the level of hur­ried­ness doesn’t de­pend on which topic the in­di­vi­d­ual has pre­pared – there are 6 con­di­tions peo­ple are put in: GS low, medium and high ur­gency and con­trol low, medium and high ur­gency.

On the way to the other room, you ar­range for them to come across some­one slumped in a door­way, with an ap­par­ent res­pi­ra­tory con­di­tion.

You mon­i­tor the sub­jects’ re­sponses and analyse the re­sults.

My first ques­tion was whether they would ad­just their p-value re­quire­ment for the 5 vari­ables they were test­ing but no, it turns out that p<0.05 was deemed ad­e­quate for sig­nifi­cance. Ok, could be worse I guess. More on this later.

The sec­ond place where doubts started to creep in were the rank­ings of re­sponses:

0 = failed to no­tice the vic­tim as pos­si­bly in need at all;

1 = per­ceived the vic­tim as pos­si­bly in need but did not offer aid;

2 = did not stop but helped in­di­rectly (e.g., by tel­ling Steiner’s as­sis­tant about the vic­tim);

3 = stopped and asked if vic­tim needed help;

4 = af­ter stop­ping, in­sisted on tak­ing the vic­tim in­side and then left him.

5 = af­ter stop­ping, re­fused to leave the vic­tim (af­ter 3-5 min­utes) and/​or in­sisted on tak­ing him some­where out­side ex­per­i­men­tal con­text (e.g., for coffee or to the in­fir­mary).

It seems to me that there are two differ­ent things be­ing mea­sured here:

How likely are they to no­tice some­one in distress

If they no­tice, how likely are they to help

How­ever, these two fac­tors have been put on the same scale. Surely it would be far more in­for­ma­tive to sep­a­rate them – does be­ing in a hurry make you less likely to help some­one you see who is in need or does it just make you less ob­ser­vant?

But this is rel­a­tively minor and the vic­tim doesn’t re­ally mind why you’re not helping, just whether you help or not – there could still be some use­ful re­sults here. From the con­clu­sion, it looks like not notic­ing the vic­tim at all was rare so this failure mode was maybe not too big a deal.

Draw­ing the wrong conclusions

So, on to the con­clu­sions from the re­port:

1. Peo­ple who had been prepar­ing GS talks were no more likely to help

2. Peo­ple who were in a hurry were less likely to help

3. Reli­gios­ity type had an effect on the type of help given

It was the first which I was most in­ter­ested in so I looked at the ac­tual re­sults.

53% of the peo­ple who had been prepar­ing GS talks offered some kind of help (10/​19). 29% of the peo­ple prepar­ing non-GS talks offered some kind of help (6/​21).

Wait, surely that means peo­ple who pre­pared a GS talk were 1.8x more likely to help than those with an al­ter­na­tive topic? Oh no, says the re­port. The differ­ence was not sig­nifi­cant at the p<0.05 level. There­fore, there is no effect. This isn’t speci­fi­cally stated in that way but “lack of sig­nifi­cant effect” is im­me­di­ately fol­lowed by “lack of effect”:

Although the de­gree to which a per­son was in a hurry had a clearly sig­nifi­cant effect on his like­li­hood of offer­ing the vic­tim help, whether he was go­ing to give a ser­mon on the parable or on pos­si­ble vo­ca­tional roles of ministers did not. This lack of effect of ser­mon topic raises cer­tain difficul­ties for an ex­pla­na­tion of helping be­hav­ior in­volv­ing helping norms and their salience.

The pa­per goes some way to re­deem­ing it­self by stat­ing:

The re­sults were in the di­rec­tion sug­gested by the norm salience hy­poth­e­sis, but they were not sig­nifi­cant. The most ac­cu­rate con­clu­sion seems to be that salience of helping norms is a less strong de­ter­mi­nant of helping be­hav­ior in the pre­sent situ­a­tion than many, in­clud­ing the pre­sent au­thors, would ex­pect.

It then un­does the good work in the next sen­tence:

Think­ing about the Good Sa­mar­i­tan did not in­crease helping behaviour

Part of me wants to be happy that they at least in­cluded a fairly ac­cu­rate de­scrip­tion of the ev­i­dence but the re­peated stat­ing of the in­cor­rect con­clu­sion through­out the re­port can only lead read­ers to the wrong con­clu­sion.

At one point, the pa­per seems to go even fur­ther and claims that the fact that we can’t re­ject the null hy­poth­e­sis is con­fir­ma­tion of the null hy­poth­e­sis:

The pre­dic­tion in­volved in the first hy­poth­e­sis con­cern­ing the mes­sage con­tent was based on the parable. The parable it­self seemed to sug­gest that think­ing pi­ous thoughts would not in­crease helping. Another and con­flict­ing pre­dic­tion might be pro­duced by a norm salience the­ory. Think­ing about the parable should make norms for helping salient and there­fore pro­duce more helping. The data, as hy­poth­e­sized, are more con­gru­ent with the pre­dic­tion drawn from the parable. A per­son go­ing to speak on the parable of the Good Sa­mar­i­tan is not sig­nifi­cantly more likely to stop to help a per­son by the side of the road than is a per­son go­ing to talk about pos­si­ble oc­cu­pa­tions for sem­i­nary grad­u­ates.
Since both situ­a­tional hy­pothe­ses are con­firmed…


Some­how, the pa­per man­ages to make the “pi­ous thoughts are in­effec­tive” hy­poth­e­sis into the null hy­poth­e­sis and the “norm salience” hy­poth­e­sis into the al­ter­na­tive hy­poth­e­sis. Then, when the re­sults are not sig­nifi­cant to re­ject the null hy­poth­e­sis this is treated as con­fir­ma­tion that the null hy­poth­e­sis is true. This is the equiv­a­lent of ac­cept­ing p<0.95 as ev­i­dence for the “pi­ous thoughts are in­effec­tive” hy­poth­e­sis.

(Aside: I’m no the­olo­gian but I’m not re­ally sure that “pi­ous thoughts are in­effec­tive” is re­ally what the parable im­plies. Je­sus of­ten used the re­li­gious lead­ers as the bad guys in his parables so he may just be re­peat­ing that point)

Effect Size, Sig­nifi­cance and Ex­per­i­men­tal Power

I think the is­sue here is con­fu­sion be­tween effect size and sig­nifi­cance.

The effect size is ac­tu­ally pretty good (80% in­crease in helping). In fact, in the con­di­tion that the GS par­ti­ci­pants weren’t rush­ing they av­er­aged an im­pres­sive score of 3.8 (com­pared to 1.667 for the equiv­a­lent non-GS par­ti­ci­pants).

The fact that this doesn’t rise to sig­nifi­cance has lit­tle to do with effect size and ev­ery­thing to do with ex­per­i­men­tal power.

The sam­ple size was 40. There were 6 cat­e­gories re­lat­ing to the first 2 hy­pothe­ses (3 hurry con­di­tions x 2 mes­sage con­di­tions). If for each of the 3 re­li­gios­ity type con­di­tions a par­ti­ci­pant was just rated as “high” or “low” then this is 8 cat­e­gories. That makes a to­tal of 48 pos­si­ble cat­e­gori­sa­tions for each sub­ject to cover the 3 hy­pothe­ses. We’ve man­aged to get more po­ten­tial cat­e­gori­sa­tions of each sub­ject than we have sub­jects.


(Ac­tu­ally, this may not be ir­re­triev­able in and of it­self – it just threw up a big red flag for me. If all the other parts of the ex­per­i­ment were on the money this could just be effi­ciently test­ing as many differ­ent hy­pothe­ses as pos­si­ble given the limited data points available. The real prob­lem is that if we ad­just for mul­ti­ple vari­able test­ing then the re­quired p-value for sig­nifi­cance goes down and power goes down with it.)

In ad­di­tion to sam­ple size, ex­per­i­men­tal power de­pends on vari­a­tion in the de­pen­dant vari­able due to other sources (I’m happy to ac­cept that they had low mea­sure­ment er­ror). My best guess is that there is sig­nifi­cant vari­a­tion due to other sources al­though I don’t have the data to show this. A num­ber of per­son­al­ity traits had been in­ves­ti­gated pre­vi­ously (Machi­avel­li­anism, au­thor­i­tar­i­anism, so­cial de­sir­a­bil­ity, aliena­tion, and so­cial re­spon­si­bil­ity) and found not to sig­nifi­cantly cor­re­late with helping be­havi­our, so my ex­pec­ta­tion would be that find­ing a true effect is difficult and un­ex­plained vari­a­tion in helping is large.

If ex­per­i­men­tal power is low, in or­der to find sig­nifi­cant re­sults, the effect size must be large.

As the effect size of read­ing GS was be­low the effect size re­quired, the re­sult is not statis­ti­cally sig­nifi­cant.

If an effect size of in­creas­ing helping by 80% is not sig­nifi­cant, you re­ally should have known be­fore the ex­per­i­ment that you didn’t have enough power.

Fur­ther re­duc­ing ex­per­i­men­tal power

If you thought that N=40 was ques­tion­able, wait un­til you see what comes next. The pa­per goes on to see if the in­put vari­ables cor­re­late with the amount of help given when help was given. Only 16 peo­ple gave any help so sud­denly N=16.


This seems like bad news for find­ing sig­nifi­cance but we sud­denly do have a sig­nifi­cant effect. It turns out that scor­ing higher on see­ing re­li­gion as a quest makes you likely to offer less help than if your score lower on this met­ric. This is con­trary to the ex­per­i­menters’ ex­pec­ta­tions.

After perform­ing some ex­tra calcu­la­tions, the ex­per­i­menters con­clude that this is be­cause those who scored lower on this met­ric were likely to offer over-the-top as­sis­tance and score a 5 which skewed the re­sults.

Allow me to offer an al­ter­na­tive ex­pla­na­tion.

The pa­per has so far calcu­lated 18 differ­ent p-val­ues (3 from ANOVA of mes­sage x hurry, 10 from lin­ear re­gres­sion of full data (5 x help vs no help, 5 x scor­ing sys­tem) and 5 from lin­ear re­gres­sion of only helpful par­ti­ci­pants). There were ac­tu­ally an­other 10 p-val­ues calcu­lated in their step­wise mul­ti­ple re­gres­sion anal­y­sis but these seem to have been ig­nored so I’ll gloss over that.

Now for each p-value which you calcu­late you have a 5% chance of find­ing a spu­ri­ous re­sult. I’ll take off the 3 p-value calcu­la­tions which yielded true effects and say 15 op­por­tu­ni­ties to get a spu­ri­ous p-value.

0.95 ^ 15 = 0.46

At this point, you are more likely to have achieved a spu­ri­ous p-value than not from all the calcu­lated p-val­ues. Some of the p-val­ues calcu­lated are re­lated so that may change the ex­act value but the prob­a­bil­ity of a spu­ri­ous re­sult is un­com­fortably high.

Re­mem­ber that an in­crease in helping of 80% didn’t achieve sig­nifi­cance when N=40. The effect size must be truly huge in or­der to achieve sig­nifi­cance with N=16 (The ac­tual effect size isn’t given in the re­port).

Be­cause their prior for this effect be­ing true is fairly low (it’s huge and in the op­po­site di­rec­tion to ex­pec­ta­tion) it would be rea­son­able to say that the p-value is prob­a­bly spu­ri­ous in the re­port with a note that this might be worth in­ves­ti­gat­ing fur­ther in the fu­ture.

In­stead, the re­port ends up with a weird con­clu­sion that low re­li­gion-as-a-quest scor­ing peo­ple are more likely to offer over-the-top help. The fact that they achieve an ad­di­tional sig­nifi­cant p-value when the in­tro­duce a new cat­e­gori­sa­tion sys­tem (over-the-top help vs rea­son­able help) doesn’t add much to the like­li­hood of their con­clu­sion – it just shows that they are able to look at their data and see a pat­tern.

In­tro­duc­ing new variables

At this point, an­other in­put vari­able is in­tro­duced. The origi­nal 3 types of re­li­gios­ity were made up of scores from 6 differ­ent scales which were weighted to cre­ate the 3 types. Sud­denly one of the 6 origi­nal scales is grabbed out (doc­tri­nal or­tho­doxy) and this cor­re­lates even more strongly with giv­ing over-the-top help (p<0.01).


In­tro­duc­ing a new cat­e­gori­sa­tion (over-the-top help) and a new vari­able (doc­tri­nal or­tho­doxy) to try to ex­plain a (prob­a­bly) spu­ri­ous p-value from mul­ti­ple hy­poth­e­sis test­ing is NOT a good idea.

We now have 4 differ­ent po­ten­tial cat­e­gori­sa­tions and 11 vari­ables (the origi­nal 5 plus the 6 newly in­tro­duced scales). This makes 44 differ­ent po­ten­tial p-val­ues to calcu­late even be­fore we con­sider the differ­ent types of tests that the au­thors might try (sim­ple lin­ear re­gres­sion, ANOVA, step­wise mul­ti­ple lin­ear re­gres­sion). I don’t think they calcu­lated all of these 44+ p-val­ues but rather looked at the data and de­cided which ones looked promis­ing.

0.99 ^ 44 = 0.64

So now, even in the best case, a p<0.01 would hap­pen in more than a third of similar ex­per­i­ments just by co­in­ci­dence.

I don’t think that the effect de­scribed is im­pos­si­ble but I think the failure to ad­just for mul­ti­ple vari­ables is a much more likely ex­pla­na­tion.


So in con­clu­sion, against all ex­pec­ta­tion, read­ing and prepar­ing a talk on a parable given by the leader of your re­li­gion on how we should help peo­ple who are in need does, in fact, in­crease the like­li­hood that you will, in the next 5 min­utes, help some­one who is in need.

The fact that be­ing in a hurry is a larger effect is the truly in­ter­est­ing find­ing here but I think not a huge sur­prise.

This is why I asked the ques­tion the way I did in the in­tro­duc­tion – I didn’t get the chance to guess this blind and I’m not sure which way I would have voted if I had.

I’m con­fi­dent that I wouldn’t have pre­dicted quite such a big drop of help be­tween GS low hurry and GS high hurry so I’ll have to up­date ac­cord­ingly (av­er­age score 3.8 down to score 1).

One fi­nal thing: