Why the tails come apart

[I’m un­sure how much this re­hashes things ‘ev­ery­one knows already’ - if old hat, feel free to down­vote into oblivion. My other mo­ti­va­tion for the cross-post is the hope it might catch the in­ter­est of some­one with a stronger math­e­mat­i­cal back­ground who could make this line of ar­gu­ment more ro­bust]

[Edit 2014/​11/​14: mainly ad­just­ments and re­word­ing in light of the many helpful com­ments be­low (thanks!). I’ve also added a ge­o­met­ric ex­pla­na­tion.]

Many out­comes of in­ter­est have pretty good pre­dic­tors. It seems that height cor­re­lates to perfor­mance in bas­ket­ball (the av­er­age height in the NBA is around 6′7″). Faster serves in ten­nis im­prove one’s like­li­hood of win­ning. IQ scores are known to pre­dict a slew of fac­tors, from in­come, to chance of be­ing im­pris­oned, to lifes­pan.

What’s in­ter­est­ing is what hap­pens to these re­la­tion­ships ‘out on the tail’: ex­treme out­liers of a given pre­dic­tor are sel­dom similarly ex­treme out­liers on the out­come it pre­dicts, and vice versa. Although 6′7″ is very tall, it lies within a cou­ple of stan­dard de­vi­a­tions of the me­dian US adult male height—there are many thou­sands of US men taller than the av­er­age NBA player, yet are not in the NBA. Although elite ten­nis play­ers have very fast serves, if you look at the play­ers serv­ing the fastest serves ever recorded, they aren’t the very best play­ers of their time. It is harder to look at the IQ case due to test ceilings, but again there seems to be some di­ver­gence near the top: the very high­est earn­ers tend to be very smart, but their in­tel­li­gence is not in step with their in­come (their cog­ni­tive abil­ity is around +3 to +4 SD above the mean, yet their wealth is much higher than this) (1).

The trend seems to be that even when two fac­tors are cor­re­lated, their tails di­verge: the fastest servers are good ten­nis play­ers, but not the very best (and the very best play­ers serve fast, but not the very fastest); the very rich­est tend to be smart, but not the very smartest (and vice versa). Why?

Too much of a good thing?

One can­di­date ex­pla­na­tion would be that more isn’t always bet­ter, and the cor­re­la­tions one gets look­ing at the whole pop­u­la­tion doesn’t cap­ture a re­ver­sal at the right tail. Maybe be­ing taller at bas­ket­ball is good up to a point, but be­ing re­ally tall leads to greater costs in terms of things like ag­ility. Maybe al­though hav­ing a faster serve is bet­ter all things be­ing equal, but fo­cus­ing too heav­ily on one’s serve coun­ter­pro­duc­tively ne­glects other ar­eas of one’s game. Maybe a high IQ is good for earn­ing money, but a strato­spher­i­cally high IQ has an in­creased risk of pro­duc­tivity-re­duc­ing men­tal ill­ness. Or some­thing along those lines.

I would guess that these sorts of ‘hid­den trade-offs’ are com­mon. But, the ‘di­ver­gence of tails’ seems pretty ubiquitous (the tallest aren’t the heav­iest, the smartest par­ents don’t have the smartest chil­dren, the fastest run­ners aren’t the best foot­ballers, etc. etc.), and it would be weird if there was always a ‘too much of a good thing’ story to be told for all of these as­so­ci­a­tions. I think there is a more gen­eral ex­pla­na­tion.

The sim­ple graph­i­cal explanation

[In­spired by this es­say from Grady Tow­ers]

Sup­pose you make a scat­ter plot of two cor­re­lated vari­ables. Here’s one I grabbed off google, com­par­ing the speed of a ball out of a base­ball pitch­ers hand com­pared to its speed cross­ing cross­ing the plate:

It is un­sur­pris­ing to see these are cor­re­lated (I’d guess the R-square is > 0.8). But if one looks at the ex­treme end of the graph, the very fastest balls out of the hand aren’t the very fastest balls cross­ing the plate, and vice versa. This fea­ture is gen­eral. Look at this data (again con­ve­nience sam­pled from googling ‘scat­ter plot’) of this:

Or this:

Or this:

Given a cor­re­la­tion, the en­velope of the dis­tri­bu­tion should form some sort of el­lipse, nar­rower as the cor­re­la­tion goes stronger, and more cir­cu­lar as it gets weaker: (2)

The thing is, as one ap­proaches the far cor­ners of this el­lipse, we see ‘di­ver­gence of the tails’: as the el­lipse doesn’t sharpen to a point, there are bul­ges where the max­i­mum x and y val­ues lie with sub-max­i­mal y and x val­ues re­spec­tively:

So this offers an ex­pla­na­tion why di­ver­gence at the tails is ubiquitous. Pro­vid­ing the sam­ple size is largeish, and the cor­re­la­tion not too tight (the tighter the cor­re­la­tion, the larger the sam­ple size re­quired), one will ob­serve the el­lipses with the bulging sides of the dis­tri­bu­tion. (3)

Hence the very best bas­ket­ball play­ers aren’t the very tallest (and vice versa), the very wealthiest not the very smartest, and so on and so forth for any cor­re­lated X and Y. If X and Y are “Es­ti­mated effect size” and “Ac­tual effect size”, or “Perfor­mance at T”, and “Perfor­mance at T+n”, then you have a graph­i­cal dis­play of win­ner’s curse and re­gres­sion to the mean.

An in­tu­itive ex­pla­na­tion of the graph­i­cal explanation

It would be nice to have an in­tu­itive han­dle on why this hap­pens, even if we can be con­vinced that it hap­pens. Here’s my offer to­wards an ex­pla­na­tion:

The fact that a cor­re­la­tion is less than 1 im­plies that other things mat­ter to an out­come of in­ter­est. Although be­ing tall mat­ters for be­ing good at bas­ket­ball, strength, ag­ility, hand-eye-co­or­di­na­tion mat­ter as well (to name but a few). The same ap­plies to other out­comes where mul­ti­ple fac­tors play a role: be­ing smart helps in get­ting rich, but so does be­ing hard work­ing, be­ing lucky, and so on.

For a toy model, pre­tend that wealth is wholly ex­plained by two fac­tors: in­tel­li­gence and con­scien­tious­ness. Let’s also say these are equally im­por­tant to the out­come, in­de­pen­dent of one an­other and are nor­mally dis­tributed. (4) So, ce­teris paribus, be­ing more in­tel­li­gent will make one richer, and the toy model stipu­lates there aren’t ‘hid­den trade-offs’: there’s no nega­tive cor­re­la­tion be­tween in­tel­li­gence and con­scien­tious­ness, even at the ex­tremes. Yet the graph­i­cal ex­pla­na­tion sug­gests we should still see di­ver­gence of the tails: the very smartest shouldn’t be the very rich­est.

The in­tu­itive ex­pla­na­tion would go like this: start at the ex­treme tail - +4SD above the mean for in­tel­li­gence, say. Although this gives them a mas­sive boost to their wealth, we’d ex­pect them to be av­er­age with re­spect to con­scien­tious­ness (we’ve stipu­lated they’re in­de­pen­dent). Fur­ther, as this ul­tra-smart pop­u­la­tion is small, we’d ex­pect them to fall close to the av­er­age in this other in­de­pen­dent fac­tor: with 10 peo­ple at +4SD, you wouldn’t ex­pect any of them to be +2SD in con­scien­tious­ness.

Move down the tail to less ex­tremely smart peo­ple - +3SD say. Th­ese peo­ple don’t get such a boost to their wealth from their in­tel­li­gence, but there should be a lot more of them (if 10 at +4SD, around 500 at +3SD), this means one should ex­pect more vari­a­tion in con­scien­tious­ness—it is much less sur­pris­ing to find some­one +3SD in in­tel­li­gence and also +2SD in con­scien­tious­ness, and in the world where these things were equally im­por­tant, they would ‘beat’ some­one +4SD in in­tel­li­gence but av­er­age in con­scien­tious­ness. Although a +4SD in­tel­li­gence per­son will likely be bet­ter than a given +3SD in­tel­li­gence per­son (the mean con­scien­tious­ness in both pop­u­la­tions is 0SD, and so the av­er­age wealth of the +4SD in­tel­li­gence pop­u­la­tion is 1SD higher than the 3SD in­tel­li­gence peo­ple), the wealthiest of the +4SDs will not be as good as the best of the much larger num­ber of +3SDs. The same sort of story emerges when we look at larger num­bers of fac­tors, and in cases where the fac­tors con­tribute un­equally to the out­come of in­ter­est.

When look­ing at a fac­tor known to be pre­dic­tive of an out­come, the largest out­come val­ues will oc­cur with sub-max­i­mal fac­tor val­ues, as the larger pop­u­la­tion in­creases the chances of ‘get­ting lucky’ with the other fac­tors:

So that’s why the tails di­verge.

A par­allel ge­o­met­ric explanation

There’s also a ge­o­met­ric ex­pla­na­tion. The R-square mea­sure of cor­re­la­tion be­tween two sets of data is the same as the co­sine of the an­gle be­tween them when pre­sented as vec­tors in N-di­men­sional space (ex­pla­na­tions, deriva­tions, and elab­o­ra­tions here, here, and here). (5) So here’s an­other in­tu­itive han­dle for tail di­ver­gence:

Grant a fac­tor cor­re­lated with an out­come, which we rep­re­sent with two vec­tors at an an­gle theta, the in­verse co­sine equal the R-squared. ‘Read­ing off the ex­pected out­come given a fac­tor score is just mov­ing along the fac­tor vec­tor and mul­ti­ply­ing by co­sine theta to get the dis­tance along the out­come vec­tor. As cos theta is never greater than 1, we see re­gres­sion to the mean. The ge­o­met­ri­cal analogue to the tails com­ing apart is the ab­solute differ­ence in length along fac­tor ver­sus length along out­come|fac­tor scales with the length along the fac­tor; the gap be­tween ex­treme val­ues of a fac­tor and the less ex­treme val­ues of the out­come grows lin­early as the fac­tor value gets more ex­treme. For con­crete­ness (and grant­ing nor­mal­ity), an R-square of 0.5 (cor­re­spond­ing to an an­gle of sixty de­grees) means that +4SD (~1/​15000) on a fac­tor will be ex­pected to be ‘merely’ +2SD (~1/​40) in the out­come—and an R-square of 0.5 is re­mark­ably strong in the so­cial sci­ences, im­ply­ing it ac­counts for half the var­i­ance.(6) The re­verse—ex­treme out­liers on out­come are not ex­pected to be so ex­treme an out­lier on a given con­tribut­ing fac­tor—fol­lows by sym­me­try.

End­note: EA relevance

I think this is in­ter­est­ing in and of it­self, but it has rele­vance to Effec­tive Altru­ism, given it gen­er­ally fo­cuses on the right tail of var­i­ous things (What are the most effec­tive char­i­ties? What is the best ca­reer? etc.) It gen­er­ally vin­di­cates wor­ries about re­gres­sion to the mean or win­ner’s curse, and sug­gests that these will be pretty in­sol­u­ble in all cases where the pop­u­la­tions are large: even if you have re­ally good means of as­sess­ing the best char­i­ties or the best ca­reers so that your as­sess­ments cor­re­late re­ally strongly with what ones ac­tu­ally are the best, the very best ones you iden­tify are un­likely to be ac­tu­ally the very best, as the tails will di­verge.

This prob­a­bly has limited prac­ti­cal rele­vance. Although you might ex­pect that one of the ‘not es­ti­mated as the very best’ char­i­ties is in fact bet­ter than your es­ti­mated-to-be-best char­ity, you don’t know which one, and your best bet re­mains your es­ti­mate (in the same way—at least in the toy model above—you should bet a 6′11“ per­son is bet­ter at bas­ket­ball than some­one who is 6′4”.)

There may be spread bet­ting or port­fo­lio sce­nar­ios where this fac­tor comes into play—per­haps in­stead of fund­ing AMF to diminish­ing re­turns when its marginal effec­tive­ness dips be­low char­ity #2, we should be will­ing to spread funds sooner.(6) Mainly, though, it should lead us to be less self-con­fi­dent.


1. Given in­come isn’t nor­mally dis­tributed, us­ing SDs might be mis­lead­ing. But non-para­met­ric rank­ing to get a similar pic­ture: if Bill Gates is ~+4SD in in­tel­li­gence, de­spite be­ing the rich­est man in amer­ica, he is ‘merely’ in the smartest tens of thou­sands. Look­ing the other way, one might look at the gen­er­ally mod­est achieve­ments of peo­ple in high-IQ so­cieties, but there are wor­ries about ad­verse se­lec­tion.

2. As nshep­perd notes be­low, this de­pends on some­thing like mul­ti­vari­ate CLT. I’m pretty sure this can be weak­ened: all that is needed, by the lights of my graph­i­cal in­tu­ition, is that the en­velope be con­cave. It is also worth clar­ify­ing the ‘en­velope’ is only meant to illus­trate the shape of the dis­tri­bu­tion, rather than some bound­ary that con­tains the en­tire prob­a­bil­ity den­sity: as sug­gested by ho­munq: it is an ‘pdf iso­bar’ where prob­a­bil­ity den­sity is higher in­side the line than out­side it.

3. One needs a large enough sam­ple to ‘fill in’ the el­lip­ti­cal pop­u­la­tion den­sity en­velope, and the tighter the cor­re­la­tion, the larger the sam­ple needed to fill in the sub-max­i­mal bul­ges. The old faith­ful case is an ex­am­ple where ac­tu­ally you do get a ‘point’, al­though it is likely an out­lier.

4. It’s clear that this model is fairly easy to ex­tend to >2 fac­tor cases, but it is worth not­ing that in cases where the fac­tors are pos­i­tively cor­re­lated, one would need to take what­ever com­po­nent of the fac­tors which are in­de­pen­dent of one an­other.

5. My in­tu­ition is that in carte­sian co­or­di­nates the R-square be­tween cor­re­lated X and Y is ac­tu­ally also the co­sine of the an­gle be­tween the re­gres­sion lines of X on Y and Y on X. But I can’t see an ob­vi­ous deriva­tion, and I’m too lazy to demon­strate it my­self. Sorry!

6. Another in­tu­itive div­i­dend is that this makes it clear why you can by R-squared to move be­tween z-scores of cor­re­lated nor­mal vari­ables, which wasn’t straight­for­wardly ob­vi­ous to me.

7. I’d in­tuit, but again I can’t demon­strate, the case for this be­comes stronger with highly skewed in­ter­ven­tions where al­most all the im­pact is fo­cused in rel­a­tively low prob­a­bil­ity chan­nels, like avert­ing a very speci­fied ex­is­ten­tial risk.