# Simpson’s paradox and the tyranny of strata

Simp­son’s para­dox is an ex­am­ple of how the same data can tell differ­ent sto­ries. Most peo­ple think of this as an odd lit­tle cu­ri­os­ity, or per­haps a cau­tion­ary tale about the cor­rect way to use data.

You shouldn’t see Simp­son’s para­dox like that. Rather than some lit­tle quirk, it’s ac­tu­ally just the sim­plest case of a deeper and stranger is­sue. This is less about the “right” way to an­a­lyze data and more about limits to what ques­tions data can an­swer. Simp­son’s para­dox is ac­tu­ally a bit mis­lead­ing, be­cause it has a solu­tion, while the deeper is­sue doesn’t.

This post will illus­trate this us­ing no statis­tics and (ba­si­cally) no math.

# I Zeus

You are a mor­tal. You live near Olym­pus with a flock of sheep and goats. Zeus is a jerk and has taken up shoot­ing your an­i­mals with light­ning bolts.

He doesn’t kill them; it’s just bore­dom. Trans­form­ing into an­i­mals to se­duce love in­ter­ests gets old even­tu­ally.

Any­way, you won­der: Does Zeus have a prefer­ence for shoot­ing sheep or goats? You de­cide to keep records for a year. You have 25 sheep and 25 goats, so you use a 5x5 grid with one cell for each an­i­mal.

At first glance, it seems like Zeus dis­likes goats more than He dis­likes sheep. (If you’re wor­ried about the differ­ence be­ing due to ran­dom chance, feel free to mul­ti­ply the num­ber of an­i­mals by a mil­lion.)

# II Colors

Think­ing more, it oc­curs to you that some an­i­mals have darker fur than oth­ers. You go back to your records and mark each an­i­mal ac­cord­ingly.

You re-do the anal­y­sis, split­ting into dark and light groups.

Over­all, sheep are zapped less of­ten than goats. But dark sheep are zapped more of­ten than dark goats (7⁄11 > 10⁄16) and light sheep are zapped more of­ten than light goats (5⁄14 > 3⁄9). This is the usual para­dox: The con­clu­sion changes when you switch from an­a­lyz­ing ev­ery­one to split­ting into sub­groups.

How does that re­ver­sal hap­pen? It’s sim­ple: For both sheep and goats, dark an­i­mals get zapped more of­ten, and there are more dark goats than dark sheep. Dark sheep are zapped slightly more than dark goats, and similarly for light sheep. But dump­ing all the an­i­mals to­gether changes the con­clu­sion be­cause there are so many more dark goats. That’s all there is to the reg­u­lar Simp­son’s para­dox. Group-level differ­ences can be to­tally differ­ent than sub­group differ­ences when the ra­tio of sub­groups varies.

This prob­a­bly seems like a weird lit­tle edge case so far. But let’s con­tinue.

# III Stripes

Think­ing even more, you no­tice that many of your (ap­par­ently mu­tant) an­i­mals have stripes. You pre­pare the data again, mark­ing each an­i­mal ac­cord­ing to stripes, rather than color.

You won­der, nat­u­rally, what hap­pens if you an­a­lyze these groups.

The re­sults are similar to those with color. Though sheep are zapped less of­ten than goats over­all (12⁄25 < 13⁄25), plain sheep are zapped more of­ten than plain goats (5⁄14 > 3⁄9), and striped sheep are zapped more of­ten than striped goats (7⁄11 > 10⁄16).

# IV Colors and stripes

Of course, rather than just con­sid­er­ing color or stripes, noth­ing stops you from con­sid­er­ing both.

You de­cide to con­sider all four sub­groups sep­a­rately.

Now sheep are zapped less of­ten in each sub­group. (1⁄4 < 2⁄7, 6⁄7 < 8⁄9, etc.)

When you com­pare ev­ery­one, there’s a bias in against goats. When you com­pare by color, there’s a bias against sheep. When you com­pare by stripes, there’s also a bias against sheep. Yet when you com­pare by both color and stripes, there’s a bias against goats again.

Type of an­i­mals com­pared Who gets zapped more of­ten?
All Goats
Light Sheep
Dark Sheep
Plain Sheep
Striped Sheep
Dark Plain Goats
Dark Striped Goats
Light Plain Goats
Light Striped Goats

How can this hap­pen?

To an­swer that, it’s im­por­tant to re­al­ize that any­thing can hap­pen. There could be ba­si­cally any bi­ases that re­verse (or don’t) in what­ever way when you split into sub­groups. In the table above, es­sen­tially any se­quence of goats /​ sheep is pos­si­ble in the right-hand column.

But how, you ask? How can this hap­pen? I think this is the wrong ques­tion. In­stead we should ask if there is any­thing to pre­vent this from hap­pen­ing. There are a huge va­ri­ety of pos­si­ble datasets, with all sorts of differ­ent group av­er­ages. Un­less there is some spe­cial struc­ture forc­ing things to be “or­derly”, es­sen­tially ar­bi­trary stuff can hap­pen. There is no spe­cial force here.

# V Individuals

So far, this all seems like a les­son about the right way to an­a­lyze data. In some cases, that’s prob­a­bly true. Sup­pose you are sur­prised to read that Pres­tige Air­ways is more of­ten de­layed than GreatValue Sky­bus. Look­ing closer, you no­tice that Pres­tige flies mostly be­tween snowy cities while Sky­bus mostly flies be­tween warm dry cities. Pres­tige can eas­ily have a bet­ter track record for all in­di­vi­d­ual routes, but a worse track record over­all, sim­ply be­cause they fly hard routes more of­ten. In this case, it’s prob­a­bly cor­rect say that Pres­tige is more re­li­able.

But in other cases, the les­son should be just the op­po­site: There is no “right” way to an­a­lyze data. Often the real world looks like this:

There’s no clear di­vid­ing line be­tween “dark” and “light” an­i­mals. Stripes can be dense or sparse, thick or thin, light or dark. There can be many dark spots or few light spots. This list can go on for­ever. In the real world, in­di­vi­d­u­als of­ten vary in so many ways that there’s no ob­vi­ous defi­ni­tion of sub­groups. In these cases, you don’t beat the para­dox. To get an­swers you have to make ar­bi­trary choices, yet the an­swers de­pend on the choices you make.

Ar­guably this is a philo­soph­i­cal prob­lem as much as a statis­ti­cal one. We usu­ally think about bias in terms of “groups”. If prospects vary for two “oth­er­wise iden­ti­cal” in­di­vi­d­u­als in two groups, we say there is a bias. This made sense for air­lines above: If Pres­tige was more of­ten on time than GreatValue for each route, it’s fair to say Pres­tige is more re­li­able.

But in a world of in­di­vi­d­u­als, this defi­ni­tion of bias breaks down. Sup­pose Pres­tige mostly flies in the mid­dle of the day on week­ends in win­ter, while Sky­bus mostly flies at night dur­ing the week in sum­mer. They vary from these pat­terns, but never enough that they are fly­ing the same route on the same day at the same time at the same time of year. If you want to com­pare, you can group flights by cities or day or time or sea­son, but not all of them. Differ­ent group­ings (and sub-group­ings) can give differ­ent re­sult. There sim­ply is no right an­swer.

This is the end­point of Simp­son’s para­dox: Group level differ­ences of­ten re­ally are mis­lead­ing. You can try to solve that by ac­count­ing for vari­abil­ity within groups. There are lots of com­plex ways to try to do that I haven’t talked about, but none of them solve the fun­da­men­tal prob­lem of what bias mean whens ev­ery ex­am­ple is unique.

• (If you’re wor­ried about the differ­ence be­ing due to ran­dom chance, feel free to mul­ti­ply the num­ber of an­i­mals by a mil­lion.)

[...]

They vary from these pat­terns, but never enough that they are fly­ing the same route on the same day at the same time at the same time of year. If you want to com­pare, you can group flights by cities or day or time or sea­son, but not all of them.

The prob­lem you’re us­ing Simp­son’s para­dox to point at does not have this same prop­erty of “mul­ti­ply­ing the size of the data set by ar­bi­trar­ily large num­bers doesn’t help”. If you can keep tak­ing data un­til ran­dom­ness chance is no is­sue, then they will end up hav­ing suffi­cient data in all the same sub­groups, and you can just read the cor­rect an­swer off the last mil­lion times they both flew in the same city/​day/​time/​sea­son si­mul­ta­neously.

The prob­lem you’re point­ing at fun­da­men­tally boils down to not hav­ing enough data to force your con­clu­sions, and there­fore need­ing to make judge­ment about how im­por­tant sea­son is com­pared to time of day so that you can de­ter­mine when con­di­tion­ing on more fac­tors will help rele­vance more than it will hurt by adding more noise.

• So you just need enough data that the events in­volv­ing en­tities is much greater than the num­ber of pa­ram­e­ters.

• You definitely need a num­ber of data at least ex­po­nen­tial in the num­ber of pa­ram­e­ters, since the num­ber of “bins” is ex­po­nen­tial. (It’s not so sim­ple as to say that ex­po­nen­tial is enough be­cause it de­pends on the dis­tri­bu­tional over­lap. If there are cases where one group never hits a given bin, then even an in­finite amount of data doesn’t save you.)

• I see what you’re say­ing, but I was think­ing of a case where there is zero prob­a­bil­ity of hav­ing over­lap among all fea­tures. While that tech­ni­cally re­stores the prop­erty that you can mul­ti­ply the dataset by ar­bi­trar­ily large num­bers, if feels a lit­tle like “cheat­ing” and I agree with your larger point.

I guess Simp­son’s para­dox does always have a right an­swer in “strat­ify along all fea­tures”, it’s just that the amount of data you need in­creases ex­po­nen­tially in the num­ber of rele­vant fea­tures. So I think that in the real world you can mul­ti­ply the amount of data by a very, very large num­ber and it won’t solve the prob­lem, even though in a large enough num­ber will.

In the real world it’s of­ten also sort of an open ques­tion if the num­ber of “fea­tures” is finite or not.