Excessive EDA Effortposting

In­tro­duc­tion and motivation

Science! It’s im­por­tant. It’s also difficult, largely be­cause of how eas­ily peo­ple fool them­selves. I’ve seen plenty of peo­ple (my younger self in­cluded!) lose the plot as they at­tempt their first few in­ves­ti­ga­tions, let­ting sub­tle method­olog­i­cal prob­lems ren­der their en­tire chain of rea­son­ing worth­less.

So I’ve de­cided to pick a small, well-be­haved dataset no­body’s had a thor­ough look at yet, and perform an egre­giously thor­ough and para­noid ex­plo­ra­tion. My aim is both to show­case Good Science for as­piring Data An­a­lysts, and to give oth­ers a chance to crit­i­cise my ap­proach: it’s en­tirely pos­si­ble that I’m still mak­ing some blun­ders, and if I am then I want to know.

Below is a sum­mary of my method­ol­ogy, with em­pha­sis on the mis­takes I man­aged to dodge. You can see the full R ker­nel here.

The dataset

The data was col­lected by renowned gamedev/​ed­u­ca­tor/​all-round-smart-per­son Nicky Case, in an at­tempt to de­cide which of six pro­jects (‘Crowds’, ‘Learn/​Teach’, ‘Win/​Win’, ‘Non­vi­o­lence’, ‘Un­der­stand’, and ‘Mind­ful­ness’) they should work on next. They de­scribed each pos­si­ble pro­ject, and asked fans to rank them from 1-5 stars. The poll, with full de­scrip­tions of each pro­ject, is still up here, though the pro­ject names are in­dica­tive enough that you shouldn’t need it.

‘Crowds’ won, hand­ily, and Case built it. At time of writ­ing, they are be­tween ma­jor pro­jects; my pre­text for perform­ing this anal­y­sis is helping them de­cide which of the run­ners-up (‘Un­der­stand’ and ‘Learn/​Teach’) to pri­ori­tise.

Exploration

I be­gan by load­ing and clean­ing the data. For most data anal­y­sis jobs, han­dling ir­reg­u­lar­i­ties in a dataset can take up as much time as ac­tu­ally analysing it: I picked an un­usu­ally tidy dataset so I’d be able to skip to the parts where peo­ple make in­ter­est­ing mis­takes.

The most I had to deal with was a few re­sponses with miss­ing val­ues, caused by peo­ple not filling in ev­ery part of the poll. I con­sid­ered han­dling this by im­pu­ta­tion – de­cid­ing to in­ter­pret blanks as three-star rat­ings, say – but even­tu­ally de­cided to limit sub­jec­tivity and drop all in­com­plete re­sponses.

[Nor­mally, I’d look into miss­ing val­ues in more de­tail, but be­cause I knew the causal mechanism and they were a small enough pro­por­tion of the dataset and I knew there aren’t go­ing to be out­liers (there’s no way a miss­ing value could be hid­ing a rat­ing of 1005 stars) and these seemed like the least im­por­tant re­spon­dents any­way (that they couldn’t be both­ered com­plet­ing a six-ques­tion sur­vey sug­gests they prob­a­bly don’t have strong opinions about this topic), drop­ping them seemed fair.]

I set a seed, and split the dataset in half us­ing ran­dom sam­pling. I christened one half ex­ploreDF, and the other half ver­ifyDF: my plan was to use the former for Ex­plo­ra­tory Data Anal­y­sis, and then perform any ‘proper’ statis­ti­cal tests on the lat­ter.

[A clas­sic er­ror of new Data An­a­lysts perform­ing EDA is to ex­plore their en­tire dataset, and then test hy­pothe­ses on the same data that sug­gested them: this is an is­sue so com­mon it has its own Wikipe­dia page. The heart of statis­ti­cal test­ing is find­ing re­sults which are un­likely in ways which con­tra­dict the null hy­poth­e­sis, and a re­sult which you de­cided to test be­cause you no­ticed it con­tra­dict­ing the null hy­poth­e­sis con­tra­dict­ing the null hy­poth­e­sis isn’t very un­likely.

There are workarounds – penalties you can ap­ply in an at­tempt to retroac­tively purify tests performed on the data that sug­gested them – but I’m scep­ti­cal of these. You can du­tifully mul­ti­ply your p-val­ues by N to ac­com­mo­date the fact that you picked one of N pos­si­ble com­par­i­sons to test, but there’s no sen­si­ble way to ac­com­mo­date the fact that you looked at that kind of com­par­i­son in the first place.

Tl;dr: don’t do any­thing iso­mor­phic to train­ing on the test­ing set.]

[Another very com­mon mis­take is to not ran­domise data splits, or to not set a seed when you do. If, for ex­am­ple, I’d just taken the first 50% of the rows as my ex­ploreDF, that might have in­tro­duced bias: what if the dataset were or­dered chronolog­i­cally, and those who re­sponded first had con­sis­tently differ­ent views to those who re­sponded later? The only way to en­sure a fair split is to ran­domly sam­ple the en­tire dataset.

As for failure to set seeds, that’s more bad house­keep­ing and bad habits than an ac­tual mis­take, but it’s still worth talk­ing about. When you set a seed, you guaran­tee that the split can be repli­cated by other peo­ple run­ning your code. En­abling oth­ers to repli­cate re­sults is an in­te­gral part of the sci­en­tific en­deav­our, and keeps ev­ery­one hon­est. It also lets you repli­cate your re­sults, just in case R goes hay­wire and wipes your workspace.]

The range of pos­si­ble scores is 1-5, but the av­er­age score for each column is around 3.8 stars. This sug­gests the pos­si­bil­ity of large num­bers of re­spon­dents who limit them­selves to a range 3-5 stars, plus a minor­ity who do not, and who have a dis­pro­por­tionate im­pact on the av­er­age scores. Whether this is a bug or a fea­ture is sub­jec­tive, but it’s worth look­ing into.

To in­ves­ti­gate this pos­si­bil­ity, I de­rived two ex­tra fea­tures: range (dis­tance be­tween the high­est and low­est scores given by each re­spon­dent), and pos­i­tivity (av­er­age score given by each re­spon­dent). Also, I was cu­ri­ous to see what cor­re­lated with pos­i­tivity. Were peo­ple who gen­er­ally awarded higher rat­ings more in­ter­ested in some pro­jects than oth­ers?

[I got ‘3.8 stars’ from Case’s sum­mary of the en­tire dataset: tech­ni­cally this vi­o­lates data pu­rity, but I would have got that im­pres­sion about the dis­tri­bu­tion from the next sec­tion any­way.]

I be­gan with uni­vari­ate anal­y­sis: check­ing how each of the six vari­ables be­haved on its’ own be­fore see­ing how they in­ter­acted.

They were all pretty much like this

Then, I moved on to mul­ti­vari­ate anal­y­sis. How much do these scores af­fect each other?

There’s an R func­tion I re­ally like, called ggcorr, which plots the cor­re­la­tion co­effi­cients of ev­ery vari­able against ev­ery other vari­able, and lets you see what con­nec­tions shake out.

I love this genre of tri­an­gle so much

In­fer­ences:

  • Sur­pris­ingly, there aren’t ob­vi­ous ‘cliques’ of strong mu­tual cor­re­la­tion, like I’ve got­ten used to find­ing when I use ggcorr. The clos­est thing to a clique I can see here is the mu­tual af­fec­tion be­tween ‘Crowds’, ‘Mind­ful­ness’ and ‘Win/​Win’, but it’s not par­tic­u­larly strong. Also, this find­ing has no the­o­ret­i­cal back­ing, since I can’t think of a good post-hoc ex­pla­na­tion that groups these three to­gether but leaves out ‘Non­vi­o­lence’.

  • There’s a strong gen­eral fac­tor of pos­i­tivity, as demon­strated by the fact that only two of the fif­teen mu­tual cor­re­la­tions be­tween the six origi­nal are nega­tive.

  • The two nega­tive cor­re­la­tions are be­tween ‘Un­der­stand’ and ‘Win/​Win’, and be­tween ‘Crowds’ and ‘Learn/​Teach’. The former kind of makes sense: peo­ple who like the most ab­stract and aca­demic pro­ject dis­like the fluffiest and most hu­man­is­tic pro­ject, and vice-versa. The lat­ter, how­ever, as­ton­ishes me: what’s the beef be­tween ed­u­ca­tion and net­work the­ory?

  • Five of the six least pos­i­tive cor­re­la­tions are be­tween ‘Un­der­stand’ and the other pro­jects: this pro­ject seems to be more of a niche than its com­peti­tors.

I ran a quick statis­ti­cal test – us­ing R’s de­fault cor.test func­tion – on the nega­tive re­la­tion­ship be­tween ‘Crowds’ and ‘Learn/​Teach’ in ex­ploreDF, just to check it wasn’t a fluke. The test re­turned p=0.0913: not statis­ti­cally sig­nifi­cant, but close enough to re­as­sure me.

[I know I’ll catch some flak around here for us­ing p-val­ues in­stead of Bayesian stats, but p-val­ues are what peo­ple are fa­mil­iar with, so they’re what I use to sup­port any re­sults I might want to share with the un­en­light­ened.]

Then, I added range and pos­i­tivity to the ggcorr plot.

This is also an ex­cel­lent triangle

Ad­di­tional in­fer­ences:

  • Range and pos­i­tivity are con­firmed to be strongly nega­tively cor­re­lated, sug­gest­ing (but not con­firm­ing) that my the­ory of a low-pos­i­tivity minor­ity hav­ing a dis­pro­por­tionate im­pact is cor­rect.

  • Every pro­ject cor­re­lates with the av­er­age (that’s to be ex­pected, even with­out the gen­eral fac­tor), but some fea­tures cor­re­late much more strongly than oth­ers.

Think­ing of An­scombe’s Quar­tet, I plot­ted the scores against each other on 2D graphs (see be­low). After eye­bal­ling a few, I was con­fi­dent that the strengths of these re­la­tion­ships could be ap­prox­i­mated with lin­ear cor­re­la­tion.

Noth­ing about this looks blatantly non­lin­ear; let’s keep going

What I was most in­ter­ested in, though, wasn’t the rat­ings or scores, but the prefer­ences they re­vealed. I sub­tracted the pos­i­tivity from each score, leav­ing be­hind each re­spon­dent’s prefer­ences.

On gen­eral prin­ci­ples, I glanced at the uni­vari­ate graphs for these newly de­rived val­ues . . .

Oh hey, they ac­tu­ally look kind of Gaus­sian now

. . . be­fore re-check­ing cor­re­la­tions.

What a calming colour palette

High pos­i­tivity is cor­re­lated with lik­ing ‘Win/​Win’, ‘Non­vi­o­lence’, and es­pe­cially ‘Mind­ful­ness’; low pos­i­tivity is cor­re­lated with lik­ing ‘Crowds’, ‘Learn/​Teach’, and es­pe­cially ‘Un­der­stand’.

Th­ese cor­re­la­tions looked im­por­tant, so I statis­ti­cally tested them. All p-val­ues were be­low 0.1, and the p-val­ues as­so­ci­ated with ‘Mind­ful­ness’ and ‘Un­der­stand’ were be­low 0.002 and 0.001 re­spec­tively.

In other words, the three fron­trun­ners were the three whose pro­po­nents had been least pos­i­tive over­all. Did this mean the re­sults of the origi­nal poll were pri­mar­ily the re­sult of lower-scor­ing re­spon­dents hav­ing a greater im­pact?

To in­ves­ti­gate fur­ther, I di­vided all of the pos­i­tivity-ad­justed data for each re­spon­dent by the range (in other words, a set of re­sponses [2, 2, 3, 3, 3, 5], which had be­come [-1, −1 , 0, 0, 0, 2], now be­came [-1/​3, −1/​3, 0, 0, 0, 23]), to see what things looked like when low-pos­i­tivity peo­ple weren’t hav­ing their out­sized effect.

As always, I checked the uni­vari­ate graphs for any­thing in­ter­est­ing . . .

Yup, those are lines alright

. . . be­fore mov­ing on to com­par­i­sons. The av­er­ages for my nor­mal­ised in­ter­pre­ta­tion were:

‘Un­der­stand’: 0.028

‘Non­vi­o­lence’: 0.023

‘Learn/​Teach’: 0.042

‘Win-Win’: −0.076

‘Mind­ful­ness’: −0.080

‘Crowds’: 0.063

For com­par­i­son, Case’s av­er­ages are

‘Un­der­stand’: 3.88

‘Non­vi­o­lence’: 3.80

‘Learn/​Teach’: 3.87

‘Win-Win’: 3.61

‘Mind­ful­ness’: 3.61

‘Crowds’: 3.94

This is a re­as­sur­ing null re­sult. The or­di­nal rank­ings are more-or-less pre­served: ‘Crowds’ > ‘Un­der­stand’ & ‘Learn/​Teach’ > ‘Non­vi­o­lence’ > ‘Win/​Win’ & ‘Mind­ful­ness’. The main differ­ence is that ‘Learn/​Teach’, in my ad­justed ver­sion, does sig­nifi­cantly bet­ter than ‘Un­der­stand’.

(Well, I say ‘sig­nifi­cantly’: I tried one-sam­ple t-test­ing the differ­ence be­tween ‘Learn/​Teach’ and ‘Un­der­stand’, but the p-value was em­bar­rass­ingly high. Still, a null re­sult can be worth find­ing.)

I’d found and tested some in­ter­est­ing phe­nom­ena; I felt ready to re­peat these tests on the hold­out set.


It was at this point that I sud­denly re­al­ised I’d been a com­plete idiot.

(Those of you who con­sider your­selves fa­mil­iar with statis­tics and Data Science, but didn’t catch my big mis­take on the first read-through, are in­vited to spend five min­utes re-read­ing and try­ing to work out what I’m refer­ring to be­fore con­tin­u­ing.)

I’d used R’s stan­dard tests for cor­re­la­tion and group differ­ences, but I’d failed to ac­count for the fact that R’s stan­dard tests as­sume nor­mally-dis­tributed vari­a­tion. This was de­spite the fact that I’d had no rea­son to as­sume a Gaus­sian dis­tri­bu­tion, and that I’d had visi­bly not-nor­mally-dis­tributed data star­ing me in the face ev­ery time I cre­ated a set of uni­vari­ate graphs, and that I’d been con­sciously aiming to do an ob­nox­iously thor­ough and well-sup­ported anal­y­sis. Let this be a les­son to you about how easy it is to ac­ci­den­tally fudge your num­bers while us­ing tools de­vel­oped for so­cial sci­en­tists.

On re­al­is­ing my er­ror, I re­did ev­ery statis­ti­cal test in the ex­plo­ra­tion us­ing non­para­met­ric meth­ods. In place of the t-tests, I used a Wil­coxon test, and in place of the Pear­son cor­re­la­tion tests, I used Ken­dall’s tau-b (the most com­mon method for not-nec­es­sar­ily-nor­mal datasets is a Spear­man test, but that ap­par­ently has some is­sues when han­dling dis­crete data). For­tu­nately, the main re­sults turned out more or less the same: the biggest change was to the p-value re­ported for cor­re­la­tion be­tween ‘Crowds’ and ‘Learn/​Teach’, which dropped to half its’ value and started to look testable.

[Among the many benefits of do­ing an ex­plore/​ver­ify split and sav­ing your fi­nal tests for last: any ‘oh crap oh crap I com­pletely used the wrong test here’ mo­ments can’t have data pu­rity im­pli­ca­tions un­less they hap­pen at the very end of the pro­ject.]

Testing

My main rele­vant find­ings in ex­ploreDF were:

  1. Votes for ‘Un­der­stand’ cor­re­late nega­tively with pos­i­tivity.

  2. In­ter­est in ‘Crowds’ is nega­tively cor­re­lated with in­ter­est in ‘Learn/​Teach’.

  3. The level of cor­re­la­tion be­tween ‘Crowds’ and ‘Learn/​Teach’ is un­usu­ally low.

  4. ‘Un­der­stand’ is an un­usu­ally niche topic.

  5. There’s no ma­jor change as a re­sult of nor­mal­iz­ing by range.

Ap­pro­pri­ate statis­ti­cal tests for ver­ifyDF are:

  1. After ad­just­ing for pos­i­tivity, run a one-sided Ken­dall test on ‘Un­der­stand’ vs pos­i­tivity, look­ing for nega­tive cor­re­la­tion. (α=0.001)

  2. Without ad­just­ing for pos­i­tivity, run a one-sided Ken­dall test on ‘Crowds’ vs ‘Learn/​Teach’, look­ing for nega­tive cor­re­la­tion. (α=0.05)

  3. Re­cre­ate the first ggcorr plot: if the cor­re­la­tion be­tween ‘Crowds’ and ‘Learn/​Teach’ is the low­est cor­re­la­tion out of the 15 available, con­sider this con­firmed. (this would have a 115 prob­a­bil­ity of hap­pen­ing by chance, so that’s p=α=0.0666)

  4. Re­cre­ate the first ggcorr plot: if the 5 cor­re­la­tions be­tween ‘Un­der­stand’ and other pro­jects are all among the low­est 7 in the 15 available, con­sider this con­firmed. (this would have a 1143 chance of hap­pen­ing by chance, so that’s p=α=0.007)

  5. N/​A; I don’t need a test to re­port on a change I didn’t find.

[Note that I use a Fish­e­rian ap­proach and call the crit­i­cal p-val­ues for my fi­nal tests in ad­vance. I think this is gen­er­ally good prac­tice if you can get away with it, but un­like with most of my ad­vice I’m not go­ing to be a hardass here: if you pre­fer to plug your re­sults into statis­ti­cal for­mu­lae and re­port the p-val­ues they spit out, you are valid, and so are your con­clu­sions.]

And the re­sults?

[Fun fact: I wrote the en­tire post up to this point be­fore ac­tu­ally run­ning the tests.]

  1. Test passes.

  2. Test fails.

  3. Test fails.

  4. Test passes.

[I’m mak­ing a point of pub­lish­ing my nega­tive re­sults alongside pos­i­tive ones, to avoid pub­li­ca­tion bias. Know­ing what doesn’t work is just as im­por­tant as know­ing what does; do it for the nullz.]

[I re­al­ised in ret­ro­spect that these tests in­terfere in ways that aren’t perfectly sci­en­tific. In par­tic­u­lar: given #1 pass­ing, #4 pass­ing be­comes much more likely, and vice versa. The two pos­i­tive re­sults I got are fine when con­sid­ered sep­a­rately, but I should be care­ful not to treat these two with the weight I’d as­sign to two in­de­pen­dent re­sults with the same p-val­ues.]

Interpretation

My main re­sults are that votes for ‘Un­der­stand’ nega­tively cor­re­late with pos­i­tivity, and that ‘Un­der­stand’ has un­usu­ally low de­grees of cor­re­la­tion with all the other pro­jects.

So what does this ac­tu­ally mean? Re­mem­ber, my pre­text for do­ing this is helping Case choose be­tween ‘Un­der­stand’ and ‘Learn/​Teach’, given similar av­er­age scores.

Well, based on my best un­der­stand­ing of what Case is try­ing to achieve . . . I’d say that ‘Learn/​Teach’ is prob­a­bly the bet­ter op­tion. ‘Un­der­stand’ is the preferred pro­ject of peo­ple who seemed to show least en­thu­si­asm for Case’s pro­jects in gen­eral, and whose prefer­ences were least shared by other re­spon­dents. If we’re tak­ing a util­i­tar­ian ap­proach – op­ti­mis­ing for the great­est satis­fac­tion of the great­est num­ber – it makes sense to pri­ori­tise ‘Learn/​Teach’.

How­ever, there are many ways to in­ter­pret this find­ing. I called the av­er­age score for a given re­spon­dent their ‘pos­i­tivity’, but that was just to avoid it be­ing con­fused with the av­er­age for a given pro­ject. Giv­ing lower scores on av­er­age could be ex­plained by them tak­ing a more bluntly hon­est ap­proach to ques­tion­naires, or be­ing bet­ter at pre­dict­ing what they wouldn’t en­joy, or any num­ber of other causes. I can run the num­bers, but I can’t peer into peo­ple’s souls.

Also, even if the pro-‘Un­der­stand’ sub­set were less en­thu­si­as­tic on av­er­age about Case’s work, that wouldn’t im­ply a spe­cific course of ac­tion. “The de­sires of the many out­weigh those of the few.” is a sen­si­ble re­ac­tion, but “This sub­set of my fans are the least in­ter­ested in my work, so they’re the ones I have to fo­cus on keep­ing.” would also be a com­pletely valid re­sponse, as would “This looks like an un­der­sup­plied niche, so I could prob­a­bly get a cult fol­low­ing out of filling it.”

Les­sons Learned

In the spirit of ‘tel­ling you what I told you’, here’s a quick sum­mary of ev­ery im­por­tant idea I used in this anal­y­sis:

  • Don’t test hy­pothe­ses on the data that sug­gested them. One way to avoid this is to split your dataset at the start of an ex­plo­ra­tion, and get your fi­nal re­sults by test­ing on the un­ex­plored part.

  • Splits should be done via ran­dom sam­pling. Ran­dom sam­pling should have a seed set be­fore­hand.

  • Take a quick look at uni­vari­ate graphs be­fore try­ing to do any­thing clever.

  • Re­sults which have some kind of the­o­ret­i­cal back­ing are more likely to repli­cate than re­sults which don’t.

  • If you don’t have good rea­sons to think your vari­ables are nor­mally dis­tributed, don’t use t-tests or Pear­son cor­re­la­tion tests. Th­ese tests have non­para­met­ric equiv­a­lents: use those in­stead.

  • Your in­fer­ences have limits. Know them, and state them ex­plic­itly.

I’ll also throw in a few that I didn’t have cause to use in this anal­y­sis:

  • If you’re analysing a dataset with a sin­gle re­sponse vari­able (i.e. a spe­cific fac­tor you’re try­ing to use the other fac­tors to pre­dict), it’s prob­a­bly worth look­ing at ev­ery pos­si­ble bi­vari­ate plot which con­tains it, so you can pick up on any non­lin­ear re­la­tions.

  • Some of the most use­ful in­fer­ences are achieved through mul­ti­vari­ate anal­y­sis, dis­cov­er­ing facts like “wine acidity pre­dicts wine qual­ity if and only if price per 100ml is above this level”. I didn’t try this here be­cause the dataset has too few rows, and so any ap­par­ent three-fac­tor con­nec­tion would prob­a­bly be spu­ri­ous (too many pos­si­ble hy­pothe­ses, too lit­tle data to dis­t­in­guish be­tween them). Also, I’m lazy.

  • If you’re perform­ing an anal­y­sis as part of a job ap­pli­ca­tion, cre­ate a model, even if it isn’t nec­es­sary. Wasted mo­tion is in gen­eral a bad thing, but the pur­pose of these tasks is to give you an op­por­tu­nity to show off, and one of the things your fu­ture boss most wants to know is whether you can use ba­sic Ma­chine Learn­ing tech­niques. Speak­ing of which: use Ma­chine Learn­ing, even if the gen­er­at­ing dis­tri­bu­tion is ob­vi­ous enough that you can de­rive an op­ti­mal solu­tion vi­su­ally.