Ex­cess­ive EDA Effortposting

In­tro­duc­tion and motivation

Science! It’s im­port­ant. It’s also dif­fi­cult, largely be­cause of how eas­ily people fool them­selves. I’ve seen plenty of people (my younger self in­cluded!) lose the plot as they at­tempt their first few in­vest­ig­a­tions, let­ting subtle meth­od­o­lo­gical prob­lems render their en­tire chain of reas­on­ing worth­less.

So I’ve de­cided to pick a small, well-be­haved data­set nobody’s had a thor­ough look at yet, and per­form an egre­giously thor­ough and para­noid ex­plor­a­tion. My aim is both to show­case Good Science for as­pir­ing Data Ana­lysts, and to give oth­ers a chance to cri­ti­cise my ap­proach: it’s en­tirely pos­sible that I’m still mak­ing some blun­ders, and if I am then I want to know.

Below is a sum­mary of my meth­od­o­logy, with em­phasis on the mis­takes I man­aged to dodge. You can see the full R ker­nel here.

The dataset

The data was col­lec­ted by renowned game­dev/​edu­cator/​all-round-smart-per­son Nicky Case, in an at­tempt to de­cide which of six pro­jects (‘Crowds’, ‘Learn/​Teach’, ‘Win/​Win’, ‘Non­vi­ol­ence’, ‘Under­stand’, and ‘Mind­ful­ness’) they should work on next. They de­scribed each pos­sible pro­ject, and asked fans to rank them from 1-5 stars. The poll, with full de­scrip­tions of each pro­ject, is still up here, though the pro­ject names are in­dic­at­ive enough that you shouldn’t need it.

‘Crowds’ won, hand­ily, and Case built it. At time of writ­ing, they are between ma­jor pro­jects; my pre­text for per­form­ing this ana­lysis is help­ing them de­cide which of the run­ners-up (‘Under­stand’ and ‘Learn/​Teach’) to pri­or­it­ise.


I began by load­ing and clean­ing the data. For most data ana­lysis jobs, hand­ling ir­reg­u­lar­it­ies in a data­set can take up as much time as ac­tu­ally ana­lys­ing it: I picked an un­usu­ally tidy data­set so I’d be able to skip to the parts where people make in­ter­est­ing mis­takes.

The most I had to deal with was a few re­sponses with miss­ing val­ues, caused by people not filling in every part of the poll. I con­sidered hand­ling this by im­puta­tion – de­cid­ing to in­ter­pret blanks as three-star rat­ings, say – but even­tu­ally de­cided to limit sub­jectiv­ity and drop all in­com­plete re­sponses.

[Normally, I’d look into miss­ing val­ues in more de­tail, but be­cause I knew the causal mech­an­ism and they were a small enough pro­por­tion of the data­set and I knew there aren’t go­ing to be out­liers (there’s no way a miss­ing value could be hid­ing a rat­ing of 1005 stars) and these seemed like the least im­port­ant re­spond­ents any­way (that they couldn’t be bothered com­plet­ing a six-ques­tion sur­vey sug­gests they prob­ably don’t have strong opin­ions about this topic), drop­ping them seemed fair.]

I set a seed, and split the data­set in half us­ing ran­dom sampling. I christened one half ex­ploreDF, and the other half veri­fyDF: my plan was to use the former for Ex­plor­at­ory Data Ana­lysis, and then per­form any ‘proper’ stat­ist­ical tests on the lat­ter.

[A clas­sic er­ror of new Data Ana­lysts per­form­ing EDA is to ex­plore their en­tire data­set, and then test hy­po­theses on the same data that sug­ges­ted them: this is an is­sue so com­mon it has its own Wiki­pe­dia page. The heart of stat­ist­ical test­ing is find­ing res­ults which are un­likely in ways which con­tra­dict the null hy­po­thesis, and a res­ult which you de­cided to test be­cause you no­ticed it con­tra­dict­ing the null hy­po­thesis con­tra­dict­ing the null hy­po­thesis isn’t very un­likely.

There are work­arounds – pen­al­ties you can ap­ply in an at­tempt to ret­ro­act­ively purify tests per­formed on the data that sug­ges­ted them – but I’m scep­tical of these. You can du­ti­fully mul­tiply your p-val­ues by N to ac­com­mod­ate the fact that you picked one of N pos­sible com­par­is­ons to test, but there’s no sens­ible way to ac­com­mod­ate the fact that you looked at that kind of com­par­ison in the first place.

Tl;dr: don’t do any­thing iso­morphic to train­ing on the test­ing set.]

[Another very com­mon mis­take is to not ran­dom­ise data splits, or to not set a seed when you do. If, for ex­ample, I’d just taken the first 50% of the rows as my ex­ploreDF, that might have in­tro­duced bias: what if the data­set were ordered chro­no­lo­gic­ally, and those who re­spon­ded first had con­sist­ently dif­fer­ent views to those who re­spon­ded later? The only way to en­sure a fair split is to ran­domly sample the en­tire data­set.

As for fail­ure to set seeds, that’s more bad house­keep­ing and bad habits than an ac­tual mis­take, but it’s still worth talk­ing about. When you set a seed, you guar­an­tee that the split can be rep­lic­ated by other people run­ning your code. En­abling oth­ers to rep­lic­ate res­ults is an in­teg­ral part of the sci­entific en­deav­our, and keeps every­one hon­est. It also lets you rep­lic­ate your res­ults, just in case R goes hay­wire and wipes your work­space.]

The range of pos­sible scores is 1-5, but the av­er­age score for each column is around 3.8 stars. This sug­gests the pos­sib­il­ity of large num­bers of re­spond­ents who limit them­selves to a range 3-5 stars, plus a minor­ity who do not, and who have a dis­pro­por­tion­ate im­pact on the av­er­age scores. Whether this is a bug or a fea­ture is sub­ject­ive, but it’s worth look­ing into.

To in­vest­ig­ate this pos­sib­il­ity, I de­rived two ex­tra fea­tures: range (dis­tance between the highest and low­est scores given by each re­spond­ent), and pos­it­iv­ity (av­er­age score given by each re­spond­ent). Also, I was curi­ous to see what cor­rel­ated with pos­it­iv­ity. Were people who gen­er­ally awar­ded higher rat­ings more in­ter­ested in some pro­jects than oth­ers?

[I got ‘3.8 stars’ from Case’s sum­mary of the en­tire data­set: tech­nic­ally this vi­ol­ates data pur­ity, but I would have got that im­pres­sion about the dis­tri­bu­tion from the next sec­tion any­way.]

I began with uni­vari­ate ana­lysis: check­ing how each of the six vari­ables be­haved on its’ own be­fore see­ing how they in­ter­ac­ted.

They were all pretty much like this

Then, I moved on to mul­tivari­ate ana­lysis. How much do these scores af­fect each other?

There’s an R func­tion I really like, called gg­corr, which plots the cor­rel­a­tion coef­fi­cients of every vari­able against every other vari­able, and lets you see what con­nec­tions shake out.

I love this genre of tri­angle so much


  • Sur­pris­ingly, there aren’t ob­vi­ous ‘cliques’ of strong mu­tual cor­rel­a­tion, like I’ve got­ten used to find­ing when I use gg­corr. The closest thing to a clique I can see here is the mu­tual af­fec­tion between ‘Crowds’, ‘Mind­ful­ness’ and ‘Win/​Win’, but it’s not par­tic­u­larly strong. Also, this find­ing has no the­or­et­ical back­ing, since I can’t think of a good post-hoc ex­plan­a­tion that groups these three to­gether but leaves out ‘Non­vi­ol­ence’.

  • There’s a strong gen­eral factor of pos­it­iv­ity, as demon­strated by the fact that only two of the fif­teen mu­tual cor­rel­a­tions between the six ori­ginal are neg­at­ive.

  • The two neg­at­ive cor­rel­a­tions are between ‘Under­stand’ and ‘Win/​Win’, and between ‘Crowds’ and ‘Learn/​Teach’. The former kind of makes sense: people who like the most ab­stract and aca­demic pro­ject dis­like the fluf­fi­est and most hu­man­istic pro­ject, and vice-versa. The lat­ter, how­ever, as­ton­ishes me: what’s the beef between edu­ca­tion and net­work the­ory?

  • Five of the six least pos­it­ive cor­rel­a­tions are between ‘Under­stand’ and the other pro­jects: this pro­ject seems to be more of a niche than its com­pet­it­ors.

I ran a quick stat­ist­ical test – us­ing R’s de­fault cor.test func­tion – on the neg­at­ive re­la­tion­ship between ‘Crowds’ and ‘Learn/​Teach’ in ex­ploreDF, just to check it wasn’t a fluke. The test re­turned p=0.0913: not stat­ist­ic­ally sig­ni­fic­ant, but close enough to re­as­sure me.

[I know I’ll catch some flak around here for us­ing p-val­ues in­stead of Bayesian stats, but p-val­ues are what people are fa­mil­iar with, so they’re what I use to sup­port any res­ults I might want to share with the un­en­lightened.]

Then, I ad­ded range and pos­it­iv­ity to the gg­corr plot.

This is also an ex­cel­lent triangle

Ad­di­tional in­fer­ences:

  • Range and pos­it­iv­ity are con­firmed to be strongly neg­at­ively cor­rel­ated, sug­gest­ing (but not con­firm­ing) that my the­ory of a low-pos­it­iv­ity minor­ity hav­ing a dis­pro­por­tion­ate im­pact is cor­rect.

  • Every pro­ject cor­rel­ates with the av­er­age (that’s to be ex­pec­ted, even without the gen­eral factor), but some fea­tures cor­rel­ate much more strongly than oth­ers.

Think­ing of Anscombe’s Quar­tet, I plot­ted the scores against each other on 2D graphs (see be­low). After eye­balling a few, I was con­fid­ent that the strengths of these re­la­tion­ships could be ap­prox­im­ated with lin­ear cor­rel­a­tion.

Noth­ing about this looks blatantly non­lin­ear; let’s keep going

What I was most in­ter­ested in, though, wasn’t the rat­ings or scores, but the pref­er­ences they re­vealed. I sub­trac­ted the pos­it­iv­ity from each score, leav­ing be­hind each re­spond­ent’s pref­er­ences.

On gen­eral prin­ciples, I glanced at the uni­vari­ate graphs for these newly de­rived val­ues . . .

Oh hey, they ac­tu­ally look kind of Gaus­sian now

. . . be­fore re-check­ing cor­rel­a­tions.

What a calm­ing col­our palette

High pos­it­iv­ity is cor­rel­ated with lik­ing ‘Win/​Win’, ‘Non­vi­ol­ence’, and es­pe­cially ‘Mind­ful­ness’; low pos­it­iv­ity is cor­rel­ated with lik­ing ‘Crowds’, ‘Learn/​Teach’, and es­pe­cially ‘Under­stand’.

These cor­rel­a­tions looked im­port­ant, so I stat­ist­ic­ally tested them. All p-val­ues were be­low 0.1, and the p-val­ues as­so­ci­ated with ‘Mind­ful­ness’ and ‘Under­stand’ were be­low 0.002 and 0.001 re­spect­ively.

In other words, the three fron­trun­ners were the three whose pro­ponents had been least pos­it­ive over­all. Did this mean the res­ults of the ori­ginal poll were primar­ily the res­ult of lower-scor­ing re­spond­ents hav­ing a greater im­pact?

To in­vest­ig­ate fur­ther, I di­vided all of the pos­it­iv­ity-ad­jus­ted data for each re­spond­ent by the range (in other words, a set of re­sponses [2, 2, 3, 3, 3, 5], which had be­come [-1, −1 , 0, 0, 0, 2], now be­came [-1/​3, −1/​3, 0, 0, 0, 23]), to see what things looked like when low-pos­it­iv­ity people weren’t hav­ing their out­sized ef­fect.

As al­ways, I checked the uni­vari­ate graphs for any­thing in­ter­est­ing . . .

Yup, those are lines alright

. . . be­fore mov­ing on to com­par­is­ons. The av­er­ages for my nor­m­al­ised in­ter­pret­a­tion were:

‘Under­stand’: 0.028

‘Non­vi­ol­ence’: 0.023

‘Learn/​Teach’: 0.042

‘Win-Win’: −0.076

‘Mind­ful­ness’: −0.080

‘Crowds’: 0.063

For com­par­ison, Case’s av­er­ages are

‘Under­stand’: 3.88

‘Non­vi­ol­ence’: 3.80

‘Learn/​Teach’: 3.87

‘Win-Win’: 3.61

‘Mind­ful­ness’: 3.61

‘Crowds’: 3.94

This is a re­as­sur­ing null res­ult. The or­dinal rank­ings are more-or-less pre­served: ‘Crowds’ > ‘Under­stand’ & ‘Learn/​Teach’ > ‘Non­vi­ol­ence’ > ‘Win/​Win’ & ‘Mind­ful­ness’. The main dif­fer­ence is that ‘Learn/​Teach’, in my ad­jus­ted ver­sion, does sig­ni­fic­antly bet­ter than ‘Under­stand’.

(Well, I say ‘sig­ni­fic­antly’: I tried one-sample t-test­ing the dif­fer­ence between ‘Learn/​Teach’ and ‘Under­stand’, but the p-value was em­bar­rass­ingly high. Still, a null res­ult can be worth find­ing.)

I’d found and tested some in­ter­est­ing phe­nom­ena; I felt ready to re­peat these tests on the hol­d­out set.

It was at this point that I sud­denly real­ised I’d been a com­plete idiot.

(Those of you who con­sider yourselves fa­mil­iar with stat­ist­ics and Data Science, but didn’t catch my big mis­take on the first read-through, are in­vited to spend five minutes re-read­ing and try­ing to work out what I’m re­fer­ring to be­fore con­tinu­ing.)

I’d used R’s stand­ard tests for cor­rel­a­tion and group dif­fer­ences, but I’d failed to ac­count for the fact that R’s stand­ard tests as­sume nor­mally-dis­trib­uted vari­ation. This was des­pite the fact that I’d had no reason to as­sume a Gaus­sian dis­tri­bu­tion, and that I’d had vis­ibly not-nor­mally-dis­trib­uted data star­ing me in the face every time I cre­ated a set of uni­vari­ate graphs, and that I’d been con­sciously aim­ing to do an ob­nox­iously thor­ough and well-sup­por­ted ana­lysis. Let this be a les­son to you about how easy it is to ac­ci­dent­ally fudge your num­bers while us­ing tools de­veloped for so­cial sci­ent­ists.

On real­ising my er­ror, I re­did every stat­ist­ical test in the ex­plor­a­tion us­ing non­para­met­ric meth­ods. In place of the t-tests, I used a Wil­coxon test, and in place of the Pear­son cor­rel­a­tion tests, I used Kend­all’s tau-b (the most com­mon method for not-ne­ces­sar­ily-nor­mal data­sets is a Spear­man test, but that ap­par­ently has some is­sues when hand­ling dis­crete data). For­tunately, the main res­ults turned out more or less the same: the biggest change was to the p-value re­por­ted for cor­rel­a­tion between ‘Crowds’ and ‘Learn/​Teach’, which dropped to half its’ value and star­ted to look test­able.

[Among the many be­ne­fits of do­ing an ex­plore/​verify split and sav­ing your fi­nal tests for last: any ‘oh crap oh crap I com­pletely used the wrong test here’ mo­ments can’t have data pur­ity im­plic­a­tions un­less they hap­pen at the very end of the pro­ject.]


My main rel­ev­ant find­ings in ex­ploreDF were:

  1. Votes for ‘Under­stand’ cor­rel­ate neg­at­ively with pos­it­iv­ity.

  2. In­terest in ‘Crowds’ is neg­at­ively cor­rel­ated with in­terest in ‘Learn/​Teach’.

  3. The level of cor­rel­a­tion between ‘Crowds’ and ‘Learn/​Teach’ is un­usu­ally low.

  4. ‘Under­stand’ is an un­usu­ally niche topic.

  5. There’s no ma­jor change as a res­ult of nor­mal­iz­ing by range.

Ap­pro­pri­ate stat­ist­ical tests for veri­fyDF are:

  1. After ad­just­ing for pos­it­iv­ity, run a one-sided Kend­all test on ‘Under­stand’ vs pos­it­iv­ity, look­ing for neg­at­ive cor­rel­a­tion. (α=0.001)

  2. Without ad­just­ing for pos­it­iv­ity, run a one-sided Kend­all test on ‘Crowds’ vs ‘Learn/​Teach’, look­ing for neg­at­ive cor­rel­a­tion. (α=0.05)

  3. Recre­ate the first gg­corr plot: if the cor­rel­a­tion between ‘Crowds’ and ‘Learn/​Teach’ is the low­est cor­rel­a­tion out of the 15 avail­able, con­sider this con­firmed. (this would have a 115 prob­ab­il­ity of hap­pen­ing by chance, so that’s p=α=0.0666)

  4. Recre­ate the first gg­corr plot: if the 5 cor­rel­a­tions between ‘Under­stand’ and other pro­jects are all among the low­est 7 in the 15 avail­able, con­sider this con­firmed. (this would have a 1143 chance of hap­pen­ing by chance, so that’s p=α=0.007)

  5. N/​A; I don’t need a test to re­port on a change I didn’t find.

[Note that I use a Fish­erian ap­proach and call the crit­ical p-val­ues for my fi­nal tests in ad­vance. I think this is gen­er­ally good prac­tice if you can get away with it, but un­like with most of my ad­vice I’m not go­ing to be a hardass here: if you prefer to plug your res­ults into stat­ist­ical for­mu­lae and re­port the p-val­ues they spit out, you are valid, and so are your con­clu­sions.]

And the res­ults?

[Fun fact: I wrote the en­tire post up to this point be­fore ac­tu­ally run­ning the tests.]

  1. Test passes.

  2. Test fails.

  3. Test fails.

  4. Test passes.

[I’m mak­ing a point of pub­lish­ing my neg­at­ive res­ults along­side pos­it­ive ones, to avoid pub­lic­a­tion bias. Know­ing what doesn’t work is just as im­port­ant as know­ing what does; do it for the nullz.]

[I real­ised in ret­ro­spect that these tests in­ter­fere in ways that aren’t per­fectly sci­entific. In par­tic­u­lar: given #1 passing, #4 passing be­comes much more likely, and vice versa. The two pos­it­ive res­ults I got are fine when con­sidered sep­ar­ately, but I should be care­ful not to treat these two with the weight I’d as­sign to two in­de­pend­ent res­ults with the same p-val­ues.]


My main res­ults are that votes for ‘Under­stand’ neg­at­ively cor­rel­ate with pos­it­iv­ity, and that ‘Under­stand’ has un­usu­ally low de­grees of cor­rel­a­tion with all the other pro­jects.

So what does this ac­tu­ally mean? Re­mem­ber, my pre­text for do­ing this is help­ing Case choose between ‘Under­stand’ and ‘Learn/​Teach’, given sim­ilar av­er­age scores.

Well, based on my best un­der­stand­ing of what Case is try­ing to achieve . . . I’d say that ‘Learn/​Teach’ is prob­ably the bet­ter op­tion. ‘Under­stand’ is the pre­ferred pro­ject of people who seemed to show least en­thu­si­asm for Case’s pro­jects in gen­eral, and whose pref­er­ences were least shared by other re­spond­ents. If we’re tak­ing a util­it­arian ap­proach – op­tim­ising for the greatest sat­is­fac­tion of the greatest num­ber – it makes sense to pri­or­it­ise ‘Learn/​Teach’.

However, there are many ways to in­ter­pret this find­ing. I called the av­er­age score for a given re­spond­ent their ‘pos­it­iv­ity’, but that was just to avoid it be­ing con­fused with the av­er­age for a given pro­ject. Giv­ing lower scores on av­er­age could be ex­plained by them tak­ing a more bluntly hon­est ap­proach to ques­tion­naires, or be­ing bet­ter at pre­dict­ing what they wouldn’t en­joy, or any num­ber of other causes. I can run the num­bers, but I can’t peer into people’s souls.

Also, even if the pro-‘Under­stand’ sub­set were less en­thu­si­astic on av­er­age about Case’s work, that wouldn’t im­ply a spe­cific course of ac­tion. “The de­sires of the many out­weigh those of the few.” is a sens­ible re­ac­tion, but “This sub­set of my fans are the least in­ter­ested in my work, so they’re the ones I have to fo­cus on keep­ing.” would also be a com­pletely valid re­sponse, as would “This looks like an un­der­sup­plied niche, so I could prob­ably get a cult fol­low­ing out of filling it.”

Lessons Learned

In the spirit of ‘telling you what I told you’, here’s a quick sum­mary of every im­port­ant idea I used in this ana­lysis:

  • Don’t test hy­po­theses on the data that sug­ges­ted them. One way to avoid this is to split your data­set at the start of an ex­plor­a­tion, and get your fi­nal res­ults by test­ing on the un­ex­plored part.

  • Splits should be done via ran­dom sampling. Ran­dom sampling should have a seed set be­fore­hand.

  • Take a quick look at uni­vari­ate graphs be­fore try­ing to do any­thing clever.

  • Res­ults which have some kind of the­or­et­ical back­ing are more likely to rep­lic­ate than res­ults which don’t.

  • If you don’t have good reas­ons to think your vari­ables are nor­mally dis­trib­uted, don’t use t-tests or Pear­son cor­rel­a­tion tests. These tests have non­para­met­ric equi­val­ents: use those in­stead.

  • Your in­fer­ences have lim­its. Know them, and state them ex­pli­citly.

I’ll also throw in a few that I didn’t have cause to use in this ana­lysis:

  • If you’re ana­lys­ing a data­set with a single re­sponse vari­able (i.e. a spe­cific factor you’re try­ing to use the other factors to pre­dict), it’s prob­ably worth look­ing at every pos­sible bivari­ate plot which con­tains it, so you can pick up on any non­lin­ear re­la­tions.

  • Some of the most use­ful in­fer­ences are achieved through mul­tivari­ate ana­lysis, dis­cov­er­ing facts like “wine acid­ity pre­dicts wine qual­ity if and only if price per 100ml is above this level”. I didn’t try this here be­cause the data­set has too few rows, and so any ap­par­ent three-factor con­nec­tion would prob­ably be spuri­ous (too many pos­sible hy­po­theses, too little data to dis­tin­guish between them). Also, I’m lazy.

  • If you’re per­form­ing an ana­lysis as part of a job ap­plic­a­tion, cre­ate a model, even if it isn’t ne­ces­sary. Was­ted mo­tion is in gen­eral a bad thing, but the pur­pose of these tasks is to give you an op­por­tun­ity to show off, and one of the things your fu­ture boss most wants to know is whether you can use ba­sic Machine Learn­ing tech­niques. Speak­ing of which: use Machine Learn­ing, even if the gen­er­at­ing dis­tri­bu­tion is ob­vi­ous enough that you can de­rive an op­timal solu­tion visu­ally.