Simpson’s paradox and the tyranny of strata

Link post

Simp­son’s para­dox is an ex­am­ple of how the same data can tell differ­ent sto­ries. Most peo­ple think of this as an odd lit­tle cu­ri­os­ity, or per­haps a cau­tion­ary tale about the cor­rect way to use data.

You shouldn’t see Simp­son’s para­dox like that. Rather than some lit­tle quirk, it’s ac­tu­ally just the sim­plest case of a deeper and stranger is­sue. This is less about the “right” way to an­a­lyze data and more about limits to what ques­tions data can an­swer. Simp­son’s para­dox is ac­tu­ally a bit mis­lead­ing, be­cause it has a solu­tion, while the deeper is­sue doesn’t.

This post will illus­trate this us­ing no statis­tics and (ba­si­cally) no math.

I Zeus

You are a mor­tal. You live near Olym­pus with a flock of sheep and goats. Zeus is a jerk and has taken up shoot­ing your an­i­mals with light­ning bolts.


He doesn’t kill them; it’s just bore­dom. Trans­form­ing into an­i­mals to se­duce love in­ter­ests gets old even­tu­ally.

Any­way, you won­der: Does Zeus have a prefer­ence for shoot­ing sheep or goats? You de­cide to keep records for a year. You have 25 sheep and 25 goats, so you use a 5x5 grid with one cell for each an­i­mal.

sheep v goats 1

At first glance, it seems like Zeus dis­likes goats more than He dis­likes sheep. (If you’re wor­ried about the differ­ence be­ing due to ran­dom chance, feel free to mul­ti­ply the num­ber of an­i­mals by a mil­lion.)

II Colors

Think­ing more, it oc­curs to you that some an­i­mals have darker fur than oth­ers. You go back to your records and mark each an­i­mal ac­cord­ingly.

sheep v goats 2

You re-do the anal­y­sis, split­ting into dark and light groups.

sheep v goats 3

Over­all, sheep are zapped less of­ten than goats. But dark sheep are zapped more of­ten than dark goats (7⁄11 > 10⁄16) and light sheep are zapped more of­ten than light goats (5⁄14 > 3⁄9). This is the usual para­dox: The con­clu­sion changes when you switch from an­a­lyz­ing ev­ery­one to split­ting into sub­groups.

How does that re­ver­sal hap­pen? It’s sim­ple: For both sheep and goats, dark an­i­mals get zapped more of­ten, and there are more dark goats than dark sheep. Dark sheep are zapped slightly more than dark goats, and similarly for light sheep. But dump­ing all the an­i­mals to­gether changes the con­clu­sion be­cause there are so many more dark goats. That’s all there is to the reg­u­lar Simp­son’s para­dox. Group-level differ­ences can be to­tally differ­ent than sub­group differ­ences when the ra­tio of sub­groups varies.

This prob­a­bly seems like a weird lit­tle edge case so far. But let’s con­tinue.

III Stripes

Think­ing even more, you no­tice that many of your (ap­par­ently mu­tant) an­i­mals have stripes. You pre­pare the data again, mark­ing each an­i­mal ac­cord­ing to stripes, rather than color.

sheep v goats 4

You won­der, nat­u­rally, what hap­pens if you an­a­lyze these groups.

sheep v goats 5

The re­sults are similar to those with color. Though sheep are zapped less of­ten than goats over­all (12⁄25 < 13⁄25), plain sheep are zapped more of­ten than plain goats (5⁄14 > 3⁄9), and striped sheep are zapped more of­ten than striped goats (7⁄11 > 10⁄16).

IV Colors and stripes

Of course, rather than just con­sid­er­ing color or stripes, noth­ing stops you from con­sid­er­ing both.

sheep v goats 6

You de­cide to con­sider all four sub­groups sep­a­rately.

sheep v goats 7

Now sheep are zapped less of­ten in each sub­group. (1⁄4 < 2⁄7, 6⁄7 < 8⁄9, etc.)

When you com­pare ev­ery­one, there’s a bias in against goats. When you com­pare by color, there’s a bias against sheep. When you com­pare by stripes, there’s also a bias against sheep. Yet when you com­pare by both color and stripes, there’s a bias against goats again.

Type of an­i­mals com­pared Who gets zapped more of­ten?
All Goats
Light Sheep
Dark Sheep
Plain Sheep
Striped Sheep
Dark Plain Goats
Dark Striped Goats
Light Plain Goats
Light Striped Goats

How can this hap­pen?

To an­swer that, it’s im­por­tant to re­al­ize that any­thing can hap­pen. There could be ba­si­cally any bi­ases that re­verse (or don’t) in what­ever way when you split into sub­groups. In the table above, es­sen­tially any se­quence of goats /​ sheep is pos­si­ble in the right-hand column.

But how, you ask? How can this hap­pen? I think this is the wrong ques­tion. In­stead we should ask if there is any­thing to pre­vent this from hap­pen­ing. There are a huge va­ri­ety of pos­si­ble datasets, with all sorts of differ­ent group av­er­ages. Un­less there is some spe­cial struc­ture forc­ing things to be “or­derly”, es­sen­tially ar­bi­trary stuff can hap­pen. There is no spe­cial force here.

V Individuals

So far, this all seems like a les­son about the right way to an­a­lyze data. In some cases, that’s prob­a­bly true. Sup­pose you are sur­prised to read that Pres­tige Air­ways is more of­ten de­layed than GreatValue Sky­bus. Look­ing closer, you no­tice that Pres­tige flies mostly be­tween snowy cities while Sky­bus mostly flies be­tween warm dry cities. Pres­tige can eas­ily have a bet­ter track record for all in­di­vi­d­ual routes, but a worse track record over­all, sim­ply be­cause they fly hard routes more of­ten. In this case, it’s prob­a­bly cor­rect say that Pres­tige is more re­li­able.

But in other cases, the les­son should be just the op­po­site: There is no “right” way to an­a­lyze data. Often the real world looks like this:

sheep v goats 8

There’s no clear di­vid­ing line be­tween “dark” and “light” an­i­mals. Stripes can be dense or sparse, thick or thin, light or dark. There can be many dark spots or few light spots. This list can go on for­ever. In the real world, in­di­vi­d­u­als of­ten vary in so many ways that there’s no ob­vi­ous defi­ni­tion of sub­groups. In these cases, you don’t beat the para­dox. To get an­swers you have to make ar­bi­trary choices, yet the an­swers de­pend on the choices you make.

Ar­guably this is a philo­soph­i­cal prob­lem as much as a statis­ti­cal one. We usu­ally think about bias in terms of “groups”. If prospects vary for two “oth­er­wise iden­ti­cal” in­di­vi­d­u­als in two groups, we say there is a bias. This made sense for air­lines above: If Pres­tige was more of­ten on time than GreatValue for each route, it’s fair to say Pres­tige is more re­li­able.

But in a world of in­di­vi­d­u­als, this defi­ni­tion of bias breaks down. Sup­pose Pres­tige mostly flies in the mid­dle of the day on week­ends in win­ter, while Sky­bus mostly flies at night dur­ing the week in sum­mer. They vary from these pat­terns, but never enough that they are fly­ing the same route on the same day at the same time at the same time of year. If you want to com­pare, you can group flights by cities or day or time or sea­son, but not all of them. Differ­ent group­ings (and sub-group­ings) can give differ­ent re­sult. There sim­ply is no right an­swer.

This is the end­point of Simp­son’s para­dox: Group level differ­ences of­ten re­ally are mis­lead­ing. You can try to solve that by ac­count­ing for vari­abil­ity within groups. There are lots of com­plex ways to try to do that I haven’t talked about, but none of them solve the fun­da­men­tal prob­lem of what bias mean whens ev­ery ex­am­ple is unique.