# Why the tails come apart

[I’m un­sure how much this re­hashes things ‘ev­ery­one knows already’ - if old hat, feel free to down­vote into oblivion. My other mo­ti­va­tion for the cross-post is the hope it might catch the in­ter­est of some­one with a stronger math­e­mat­i­cal back­ground who could make this line of ar­gu­ment more ro­bust]

[Edit 2014/​11/​14: mainly ad­just­ments and re­word­ing in light of the many helpful com­ments be­low (thanks!). I’ve also added a ge­o­met­ric ex­pla­na­tion.]

Many out­comes of in­ter­est have pretty good pre­dic­tors. It seems that height cor­re­lates to perfor­mance in bas­ket­ball (the av­er­age height in the NBA is around 6′7″). Faster serves in ten­nis im­prove one’s like­li­hood of win­ning. IQ scores are known to pre­dict a slew of fac­tors, from in­come, to chance of be­ing im­pris­oned, to lifes­pan.

What’s in­ter­est­ing is what hap­pens to these re­la­tion­ships ‘out on the tail’: ex­treme out­liers of a given pre­dic­tor are sel­dom similarly ex­treme out­liers on the out­come it pre­dicts, and vice versa. Although 6′7″ is very tall, it lies within a cou­ple of stan­dard de­vi­a­tions of the me­dian US adult male height—there are many thou­sands of US men taller than the av­er­age NBA player, yet are not in the NBA. Although elite ten­nis play­ers have very fast serves, if you look at the play­ers serv­ing the fastest serves ever recorded, they aren’t the very best play­ers of their time. It is harder to look at the IQ case due to test ceilings, but again there seems to be some di­ver­gence near the top: the very high­est earn­ers tend to be very smart, but their in­tel­li­gence is not in step with their in­come (their cog­ni­tive abil­ity is around +3 to +4 SD above the mean, yet their wealth is much higher than this) (1).

The trend seems to be that even when two fac­tors are cor­re­lated, their tails di­verge: the fastest servers are good ten­nis play­ers, but not the very best (and the very best play­ers serve fast, but not the very fastest); the very rich­est tend to be smart, but not the very smartest (and vice versa). Why?

# Too much of a good thing?

One can­di­date ex­pla­na­tion would be that more isn’t always bet­ter, and the cor­re­la­tions one gets look­ing at the whole pop­u­la­tion doesn’t cap­ture a re­ver­sal at the right tail. Maybe be­ing taller at bas­ket­ball is good up to a point, but be­ing re­ally tall leads to greater costs in terms of things like ag­ility. Maybe al­though hav­ing a faster serve is bet­ter all things be­ing equal, but fo­cus­ing too heav­ily on one’s serve coun­ter­pro­duc­tively ne­glects other ar­eas of one’s game. Maybe a high IQ is good for earn­ing money, but a strato­spher­i­cally high IQ has an in­creased risk of pro­duc­tivity-re­duc­ing men­tal ill­ness. Or some­thing along those lines.

I would guess that these sorts of ‘hid­den trade-offs’ are com­mon. But, the ‘di­ver­gence of tails’ seems pretty ubiquitous (the tallest aren’t the heav­iest, the smartest par­ents don’t have the smartest chil­dren, the fastest run­ners aren’t the best foot­ballers, etc. etc.), and it would be weird if there was always a ‘too much of a good thing’ story to be told for all of these as­so­ci­a­tions. I think there is a more gen­eral ex­pla­na­tion.

# The sim­ple graph­i­cal explanation

[In­spired by this es­say from Grady Tow­ers]

Sup­pose you make a scat­ter plot of two cor­re­lated vari­ables. Here’s one I grabbed off google, com­par­ing the speed of a ball out of a base­ball pitch­ers hand com­pared to its speed cross­ing cross­ing the plate:

It is un­sur­pris­ing to see these are cor­re­lated (I’d guess the R-square is > 0.8). But if one looks at the ex­treme end of the graph, the very fastest balls out of the hand aren’t the very fastest balls cross­ing the plate, and vice versa. This fea­ture is gen­eral. Look at this data (again con­ve­nience sam­pled from googling ‘scat­ter plot’) of this:

Or this:

Or this:

Given a cor­re­la­tion, the en­velope of the dis­tri­bu­tion should form some sort of el­lipse, nar­rower as the cor­re­la­tion goes stronger, and more cir­cu­lar as it gets weaker: (2)

The thing is, as one ap­proaches the far cor­ners of this el­lipse, we see ‘di­ver­gence of the tails’: as the el­lipse doesn’t sharpen to a point, there are bul­ges where the max­i­mum x and y val­ues lie with sub-max­i­mal y and x val­ues re­spec­tively:

So this offers an ex­pla­na­tion why di­ver­gence at the tails is ubiquitous. Pro­vid­ing the sam­ple size is largeish, and the cor­re­la­tion not too tight (the tighter the cor­re­la­tion, the larger the sam­ple size re­quired), one will ob­serve the el­lipses with the bulging sides of the dis­tri­bu­tion. (3)

Hence the very best bas­ket­ball play­ers aren’t the very tallest (and vice versa), the very wealthiest not the very smartest, and so on and so forth for any cor­re­lated X and Y. If X and Y are “Es­ti­mated effect size” and “Ac­tual effect size”, or “Perfor­mance at T”, and “Perfor­mance at T+n”, then you have a graph­i­cal dis­play of win­ner’s curse and re­gres­sion to the mean.

# An in­tu­itive ex­pla­na­tion of the graph­i­cal explanation

It would be nice to have an in­tu­itive han­dle on why this hap­pens, even if we can be con­vinced that it hap­pens. Here’s my offer to­wards an ex­pla­na­tion:

The fact that a cor­re­la­tion is less than 1 im­plies that other things mat­ter to an out­come of in­ter­est. Although be­ing tall mat­ters for be­ing good at bas­ket­ball, strength, ag­ility, hand-eye-co­or­di­na­tion mat­ter as well (to name but a few). The same ap­plies to other out­comes where mul­ti­ple fac­tors play a role: be­ing smart helps in get­ting rich, but so does be­ing hard work­ing, be­ing lucky, and so on.

For a toy model, pre­tend that wealth is wholly ex­plained by two fac­tors: in­tel­li­gence and con­scien­tious­ness. Let’s also say these are equally im­por­tant to the out­come, in­de­pen­dent of one an­other and are nor­mally dis­tributed. (4) So, ce­teris paribus, be­ing more in­tel­li­gent will make one richer, and the toy model stipu­lates there aren’t ‘hid­den trade-offs’: there’s no nega­tive cor­re­la­tion be­tween in­tel­li­gence and con­scien­tious­ness, even at the ex­tremes. Yet the graph­i­cal ex­pla­na­tion sug­gests we should still see di­ver­gence of the tails: the very smartest shouldn’t be the very rich­est.

The in­tu­itive ex­pla­na­tion would go like this: start at the ex­treme tail - +4SD above the mean for in­tel­li­gence, say. Although this gives them a mas­sive boost to their wealth, we’d ex­pect them to be av­er­age with re­spect to con­scien­tious­ness (we’ve stipu­lated they’re in­de­pen­dent). Fur­ther, as this ul­tra-smart pop­u­la­tion is small, we’d ex­pect them to fall close to the av­er­age in this other in­de­pen­dent fac­tor: with 10 peo­ple at +4SD, you wouldn’t ex­pect any of them to be +2SD in con­scien­tious­ness.

Move down the tail to less ex­tremely smart peo­ple - +3SD say. Th­ese peo­ple don’t get such a boost to their wealth from their in­tel­li­gence, but there should be a lot more of them (if 10 at +4SD, around 500 at +3SD), this means one should ex­pect more vari­a­tion in con­scien­tious­ness—it is much less sur­pris­ing to find some­one +3SD in in­tel­li­gence and also +2SD in con­scien­tious­ness, and in the world where these things were equally im­por­tant, they would ‘beat’ some­one +4SD in in­tel­li­gence but av­er­age in con­scien­tious­ness. Although a +4SD in­tel­li­gence per­son will likely be bet­ter than a given +3SD in­tel­li­gence per­son (the mean con­scien­tious­ness in both pop­u­la­tions is 0SD, and so the av­er­age wealth of the +4SD in­tel­li­gence pop­u­la­tion is 1SD higher than the 3SD in­tel­li­gence peo­ple), the wealthiest of the +4SDs will not be as good as the best of the much larger num­ber of +3SDs. The same sort of story emerges when we look at larger num­bers of fac­tors, and in cases where the fac­tors con­tribute un­equally to the out­come of in­ter­est.

When look­ing at a fac­tor known to be pre­dic­tive of an out­come, the largest out­come val­ues will oc­cur with sub-max­i­mal fac­tor val­ues, as the larger pop­u­la­tion in­creases the chances of ‘get­ting lucky’ with the other fac­tors:

So that’s why the tails di­verge.

# A par­allel ge­o­met­ric explanation

There’s also a ge­o­met­ric ex­pla­na­tion. The R-square mea­sure of cor­re­la­tion be­tween two sets of data is the same as the co­sine of the an­gle be­tween them when pre­sented as vec­tors in N-di­men­sional space (ex­pla­na­tions, deriva­tions, and elab­o­ra­tions here, here, and here). (5) So here’s an­other in­tu­itive han­dle for tail di­ver­gence:

Grant a fac­tor cor­re­lated with an out­come, which we rep­re­sent with two vec­tors at an an­gle theta, the in­verse co­sine equal the R-squared. ‘Read­ing off the ex­pected out­come given a fac­tor score is just mov­ing along the fac­tor vec­tor and mul­ti­ply­ing by co­sine theta to get the dis­tance along the out­come vec­tor. As cos theta is never greater than 1, we see re­gres­sion to the mean. The ge­o­met­ri­cal analogue to the tails com­ing apart is the ab­solute differ­ence in length along fac­tor ver­sus length along out­come|fac­tor scales with the length along the fac­tor; the gap be­tween ex­treme val­ues of a fac­tor and the less ex­treme val­ues of the out­come grows lin­early as the fac­tor value gets more ex­treme. For con­crete­ness (and grant­ing nor­mal­ity), an R-square of 0.5 (cor­re­spond­ing to an an­gle of sixty de­grees) means that +4SD (~1/​15000) on a fac­tor will be ex­pected to be ‘merely’ +2SD (~1/​40) in the out­come—and an R-square of 0.5 is re­mark­ably strong in the so­cial sci­ences, im­ply­ing it ac­counts for half the var­i­ance.(6) The re­verse—ex­treme out­liers on out­come are not ex­pected to be so ex­treme an out­lier on a given con­tribut­ing fac­tor—fol­lows by sym­me­try.

End­note: EA relevance

I think this is in­ter­est­ing in and of it­self, but it has rele­vance to Effec­tive Altru­ism, given it gen­er­ally fo­cuses on the right tail of var­i­ous things (What are the most effec­tive char­i­ties? What is the best ca­reer? etc.) It gen­er­ally vin­di­cates wor­ries about re­gres­sion to the mean or win­ner’s curse, and sug­gests that these will be pretty in­sol­u­ble in all cases where the pop­u­la­tions are large: even if you have re­ally good means of as­sess­ing the best char­i­ties or the best ca­reers so that your as­sess­ments cor­re­late re­ally strongly with what ones ac­tu­ally are the best, the very best ones you iden­tify are un­likely to be ac­tu­ally the very best, as the tails will di­verge.

This prob­a­bly has limited prac­ti­cal rele­vance. Although you might ex­pect that one of the ‘not es­ti­mated as the very best’ char­i­ties is in fact bet­ter than your es­ti­mated-to-be-best char­ity, you don’t know which one, and your best bet re­mains your es­ti­mate (in the same way—at least in the toy model above—you should bet a 6′11“ per­son is bet­ter at bas­ket­ball than some­one who is 6′4”.)

There may be spread bet­ting or port­fo­lio sce­nar­ios where this fac­tor comes into play—per­haps in­stead of fund­ing AMF to diminish­ing re­turns when its marginal effec­tive­ness dips be­low char­ity #2, we should be will­ing to spread funds sooner.(6) Mainly, though, it should lead us to be less self-con­fi­dent.

1. Given in­come isn’t nor­mally dis­tributed, us­ing SDs might be mis­lead­ing. But non-para­met­ric rank­ing to get a similar pic­ture: if Bill Gates is ~+4SD in in­tel­li­gence, de­spite be­ing the rich­est man in amer­ica, he is ‘merely’ in the smartest tens of thou­sands. Look­ing the other way, one might look at the gen­er­ally mod­est achieve­ments of peo­ple in high-IQ so­cieties, but there are wor­ries about ad­verse se­lec­tion.

2. As nshep­perd notes be­low, this de­pends on some­thing like mul­ti­vari­ate CLT. I’m pretty sure this can be weak­ened: all that is needed, by the lights of my graph­i­cal in­tu­ition, is that the en­velope be con­cave. It is also worth clar­ify­ing the ‘en­velope’ is only meant to illus­trate the shape of the dis­tri­bu­tion, rather than some bound­ary that con­tains the en­tire prob­a­bil­ity den­sity: as sug­gested by ho­munq: it is an ‘pdf iso­bar’ where prob­a­bil­ity den­sity is higher in­side the line than out­side it.

3. One needs a large enough sam­ple to ‘fill in’ the el­lip­ti­cal pop­u­la­tion den­sity en­velope, and the tighter the cor­re­la­tion, the larger the sam­ple needed to fill in the sub-max­i­mal bul­ges. The old faith­ful case is an ex­am­ple where ac­tu­ally you do get a ‘point’, al­though it is likely an out­lier.

4. It’s clear that this model is fairly easy to ex­tend to >2 fac­tor cases, but it is worth not­ing that in cases where the fac­tors are pos­i­tively cor­re­lated, one would need to take what­ever com­po­nent of the fac­tors which are in­de­pen­dent of one an­other.

5. My in­tu­ition is that in carte­sian co­or­di­nates the R-square be­tween cor­re­lated X and Y is ac­tu­ally also the co­sine of the an­gle be­tween the re­gres­sion lines of X on Y and Y on X. But I can’t see an ob­vi­ous deriva­tion, and I’m too lazy to demon­strate it my­self. Sorry!

6. Another in­tu­itive div­i­dend is that this makes it clear why you can by R-squared to move be­tween z-scores of cor­re­lated nor­mal vari­ables, which wasn’t straight­for­wardly ob­vi­ous to me.

7. I’d in­tuit, but again I can’t demon­strate, the case for this be­comes stronger with highly skewed in­ter­ven­tions where al­most all the im­pact is fo­cused in rel­a­tively low prob­a­bil­ity chan­nels, like avert­ing a very speci­fied ex­is­ten­tial risk.

• It’s not just that the tails stop be­ing cor­re­lated, it’s that there can be a spu­ri­ous nega­tive cor­re­la­tion. In any of your scat­ter­plots, you could slice off the top right cor­ner (with a di­ag­o­nal line run­ning down­wards to the right), and what was left above the line would look like a nega­tive cor­re­la­tion. This is some­times known as Berk­son’s para­dox.

• There’s also a re­lated prob­lem in that pop­u­la­tion sub­struc­tures can give you mul­ti­ple nega­tively cor­re­lated as­so­ci­a­tions stacked beside each other in a pos­i­tively cor­re­lated way (think of it like sev­eral di­ag­o­nal lines go­ing down­wards to the right, par­allel to each other), giv­ing an ‘ecolog­i­cal fal­lacy’ when you switch be­tween lev­els of anal­y­sis.

(A real-world case of this is re­li­gios­ity and health. In­ter­na­tion­ally, coun­tries which are less re­li­gious tend to be healthier, but of­ten within first world coun­tries, re­li­gion con­fers a sur­vival benefit.)

• Another ex­am­ple I’ve heard is SAT scores. At any given school, the math and ver­bal scores are nega­tively cor­re­lated, be­cause schools tend to se­lect peo­ple who have around the same to­tal score. But over­all, math and ver­bal scores are pos­i­tively cor­re­lated.

• Looks like you can get this if you cut the cor­ner off in a box shape too, which may be more sur­pris­ing.

• IMO this should be in main

• The up­vot­ers have spo­ken. Mov­ing to Main and pro­mot­ing.

• Com­ments---

1. The idea that iq pre­dicts in­come, life ex­pec­tancy, crim­i­nal jus­tice record, etc. de­pends on what you mean by ‘pre­dicts’ (eg con­junc­tion fal­lacy). I and many oth­ers sug­gest these are cor­re­la­tions, and many ar­gue in­stead things like in­come (of par­ents), so­cial en­vi­ron­ment, etc pre­dict iq, crime, health, etc. (of chil­dren, via a kind of markov pro­cess). (Also, if you look at in­come/​iq cor­re­la­tions, I wouldn’t be sur­prised that it is quite differ­ent for differ­ent kinds of in­come—those who made money via IT or ge­nomics, ver­sus those who made it via Wal­mart, or sports. One may ac­tu­ally have a mix­ture dis­tri­bu­tion which only ap­pears ‘nor­mal’ be­cause of suffi­ciently large size. )

1. The scat­ter plots are in­ter­est­ing, and re­mind me of S J Gould’s (widely crit­i­cized ) dis­cus­sion of at­tempts to define G, a mea­sure of gen­eral in­tel­li­gence, us­ing fac­tor analy­ses.
I think the gen­eral con­clu­sion be­fore the analy­ses is the right one—there are mul­ti­ple fac­tors. I would say many of the ‘smartest’ peo­ple (as mea­sured by say, iq) end up in aca­demic fields in math/​sci­ence/​tech­nol­ogy rather than in busi­ness with the aim of mak­ing money. There are so many fac­tors. Some aca­demics later on do go into busi­ness, ei­ther work­ing in fi­nance or ge­nomics in­dus­tries, but many don’t. One rea­son aca­demic eco­nomics is crit­i­cized is be­cause it fol­lows the pat­tern of this post—it starts with gen­eral ob­ser­va­tions, comes up with ten­ta­tive con­clu­sions, and then goes into highly de­tailed, math­e­mat­i­cal analy­ses which doesn’t re­ally add much more in­sight, though its an in­ter­est­ing ex­cer­cize.

• So in other words, it’s not that the strongest can’t also be the tallest (etc), but that some­one get­ting that lucky twice more or less never hap­pens. And if you need mul­ti­ple fac­tors to be good at some­thing, get­ting pretty lucky on sev­eral fac­tors is more likely than get­ting ex­tremely lucky on one and pretty lucky on the rest.

I en­joyed this post—very clear.

• ^^ but not, alas, as clear as your one para­graph sum­mary! Thanks!

• Should the first “pretty” there be “very”, or am I mi­s­un­der­stand­ing the point?

• To put it more sim­ply, there’s no causal rea­son why the tallest shouldn’t also be the strongest—it’s just un­likely in prac­tice for any­one to be both at the same time, be­cause both traits (su­per-height and su­per-strength) are rare and (suffi­ciently) in­de­pen­dent.

• One an­gle for think­ing about why the tails come apart (which seems worth high­light­ing even more than it was high­lighted in the OP) is that the farther out you go in the tail on some vari­able, the smaller the set of peo­ple you’re deal­ing with.

Which is bet­ter, the best bas­ket­ball team that you can put to­gether from peo­ple born in Penn­syl­va­nia or the best bas­ket­ball team that you can put to­gether from peo­ple born in Delaware? Prob­a­bly the Penn­syl­va­nia team, since there are about 13x as many peo­ple in that state so you get to draw from a larger pool. If there were no other rele­vant differ­ences be­tween the states then you’d ex­pect 13 of the best 14 play­ers to be Penn­syl­va­ni­ans, and prob­a­bly the two neigh­bor­ing states are similar enough so that Delaware can’t over­come that pop­u­la­tion gap.

Now, imag­ine you’re pick­ing the best 10 bas­ket­ball play­ers from the 1,000 tallest bas­ket­ball-aged Amer­i­cans (20-34 year-olds), and you’re putting to­gether an­other group con­sist­ing of the best 10 bas­ket­ball play­ers from the next 100,000 tallest bas­ket­ball-aged Amer­i­cans. Which is a bet­ter group of bas­ket­ball play­ers? In this case it’s not ob­vi­ous—get­ting to pick from a pool of 100x as many peo­ple is an ob­vi­ous ad­van­tage, but that height ad­van­tage could mat­ter a lot too. That’s the tails com­ing apart—the very tallest don’t nec­es­sar­ily give you the very best bas­ket­ball play­ers, be­cause “the very tallest” is a much smaller set than the “also re­ally tall but not quite as tall”.

(I ran some num­bers and es­ti­mate that the two teams are pretty similar in bas­ket­ball abil­ity. Which is a re­mark­able sign of how im­por­tant height is for bas­ket­ball—one pool has about a 4 inch height ad­van­tage on av­er­age, the other pool has 100x as many peo­ple, and those fac­tors roughly bal­ance out. If you want the ex­am­ple to more defini­tively show the tails com­ing apart, you can ex­pand the larger pool by an­other fac­tor of 30x and then they’ll clearly be bet­ter.)

Similarly, who has higher arm strength: the one per­son in our sam­ple who has the high­est grip strength, or the most arm-strong per­son out of the next ten peo­ple who rank 2-11 in grip strength? Grip strength is closely re­lated to arm strength, but you get to pick the best from a 10x larger pool if you give up a lit­tle bit of grip strength. In the graph in the OP, the per­son who was 6th (or maybe 5th) in grip strength had the high­est arm strength, so get­ting to pick from a pool of 10 was more im­por­tant. (The av­er­age arm strength of the peo­ple ranked 2-11 in grip strength was lower than the arm strength of the #1 grip­per, but we get to pick out the strongest arm of the ten rather than av­er­ag­ing them.)

So: the tails come apart be­cause most of the peo­ple aren’t way out on the tail. And you usu­ally won’t find the very best per­son at some­thing if you’re look­ing in a tiny pool, even if that’s a pretty well se­lected pool.

Thrasy­machus’s in­tu­itive ex­pla­na­tion cov­ered this—hav­ing a smaller pool to pick from hurts be­cause there are other vari­ables that mat­ter, and the smaller the pool the less you get to se­lect for peo­ple who do well on those other vari­ables. But his ex­pla­na­tion high­lighted the “other vari­ables mat­ter” part of this more than the pool size part of it, and both of these points of em­pha­sis seem helpful for get­ting an in­tu­itive grasp of the statis­tics in these types of situ­a­tions, so I figured I’d add this com­ment.

• This is all cor­rect so far as I can tell. Yay! (Post­ing be­cause of the don’t-only-post-cric­i­tism dis­ci­pline.)

• I ran some simu­la­tions in Python, and (if I did this cor­rectly), it seems that if r > 0.95, you should ex­pect the most ex­treme data-point of one vari­able to be the same in the other vari­able over 50% of the time (even more if sam­ple size ⇐ 100)

http://​​nbviewer.jupyter.org/​​github/​​ri­car­doV94/​​stats/​​blob/​​mas­ter/​​cor­re­la­tion_simu­la­tions.ipynb

• Just as mar­kets are anti-in­duc­tive, it turns out that mar­kets re­verse the “tails come apart” phe­nomenon found el­se­where. When times are “or­di­nary”, perfor­mance in differ­ent sec­tors is largely un­cor­re­lated, but when things go to shit, they go to shit all to­gether, a phe­nomenon termed “tail de­pen­dence”.

• In­ter­est­ing: Is there a story as to why that is the case? One guess that springs to mind is that mar­ket perfor­mance in sec­tors is always cor­re­lated, but you don’t see it in well func­tion­ing mar­kets due to range re­stric­tion/​tails-come-apart rea­sons, but you do see it when things go badly wrong as it re­veals more of the range.

• mar­ket perfor­mance in sec­tors is always cor­re­lated, but you don’t see it

The prob­lem is the word “always”. If I in­ter­pret it to mean “over all pos­si­ble time scales” then the claim is ba­si­cally false; if I in­ter­pret it to mean “over the longest time scales” then the claim is true, but triv­ially so given that sec­tor perfor­mances are some­times cor­re­lated.

We won’t get to an ex­pla­na­tion by just think­ing about prob­a­bil­ity mea­sures on stochas­tic pro­cesses. What’s needed here is a causal graph. The ba­sic causal graph has the fi­nan­cial sec­tor in­ter­nally highly con­nected, with the vast ma­jor­ity of the con­nec­tions be­tween lenders/​in­vestors and debtors/​in­vestees pass­ing through it. That, I think, is suffi­cient to ex­plain the stylized fact in the grand­par­ent (al­though of course fi­nan­cial re­searchers can and do find more to say).

• Great ar­ti­cle over­all. Re­gres­sion to the mean is a key fact of statis­tics, and far too few peo­ple in­cor­po­rate it into their in­tu­ition.

But there’s a key mi­s­un­der­stand­ing in the sec­ond-to-last graph (the one with the drawn-in blue and red “out­come” and “fac­tor”). The black line, in­di­cat­ing a cor­re­la­tion of 1, cor­re­sponds to noth­ing in re­al­ity. The true cor­re­la­tion is the line from the ver­ti­cal tan­gent point at the right (marked) to the ver­ti­cal tan­gent point at the left (un­marked). If causal­ity in­deed runs from “fac­tor” (height) to “out­come” (skill), that’s how much ex­tra skill an ex­tra helping of height will give you. Thus, the di­ag­o­nal red line should fol­low this di­rec­tion, not be par­allel to the 45 de­gree black line. If you draw this line, you’ll no­tice that each point on it has equal ver­ti­cal dis­tance to the top and bot­tom of the el­lip­ti­cal “en­velope” (which is, of course, not a true en­velope for all the prob­a­bil­ity mass, just an in­di­ca­tion that prob­a­bil­ity den­sity is higher for any point in­side than any point out­side).

Things are a lit­tle more com­plex if the cor­re­la­tion is due to a mu­tual cause, “re­verse” cau­sa­tion (from “out­come” to “fac­tor”), or if “fac­tor” is im­perfectly mea­sured. In that case, the line con­nect­ing the ver­ti­cal tan­gents may not cor­re­spond to any­thing in re­al­ity, though it’s still what you should fol­low to get the “right” (min­i­mum ex­pected squared er­ror) an­swer.

This may seem to be a nit­pick, but to me, this kind of pre­ci­sion is key to get­ting your in­tu­ition right.

• Thanks for this im­por­tant spot—I don’t think it is a nit­pick at all. I’m switch­ing jobs at the mo­ment, but I’ll re­vise the post (and di­a­grams) in light of this. It might be a week though, sorry!

• Bump.

(I re­al­ize you’re busy, this is just a friendly re­minder.)

Also, I added one clause to my com­ment above: the bit about “im­perfectly mea­sured”, which is of course usu­ally the case in the real world.

• Be­lat­edly up­dated. Thanks for your helpful com­ments!

• This looks cool. My biggest cau­tion would be that this effect may be tied to the spe­cific class of data gen­er­at­ing pro­cesses you’re look­ing at.

Your fram­ing seems to be that you look at the world as be­ing filled with en­tities whose fea­tures un­der any con­ceiv­able mea­sure­ments are dis­tributed as in­de­pen­dent mul­ti­vari­ate nor­mals. The pre­dic­tive fac­tor is a fea­ture and so is the out­come. Then us­ing ex­treme or­der statis­tics of the pre­dic­tive fac­tor to make in­fer­ences about the ex­treme or­der statis­tics of the out­come is in­for­ma­tive but un­re­li­able, as you illus­trated. Play­ing around in R, re­li­a­bil­ity seems bet­ter for thin-tailed dis­tri­bu­tions (e.g., uniform) and worse for heavy-tailed dis­tri­bu­tions (e.g., Cauchy). Fix­ing the dis­tri­bu­tions and let­ting the num­ber of ob­ser­va­tions vary, I agree with you that the prob­a­bil­ity of pick­ing ex­actly the great­est out­come goes to zero. But I’d con­jec­ture that the prob­a­bil­ity that the ob­ser­va­tion with the great­est fac­tor is in some fixed per­centile of the great­est out­comes will go to one, at least in the thin-tailed case and maybe in the nor­mal case.

But con­sider an­other data gen­er­at­ing pro­cess. If you carry out the fol­low­ing lit­tle ex­per­i­ment in R

fac ← rcauchy(1000)
out ←  fac + rnorm(1000)
plot(rank(fac), rank(out))
rank(out)[which.max(fac)]


it looks like ex­treme fac­tors are great pre­dic­tors of ex­treme out­comes, even though the fac­tors are only un­re­li­able pre­dic­tors of out­comes over­all. I wouldn’t be sur­prised if the prob­a­bil­ity of the great­est fac­tor pick­ing the great­est out­come goes to one as the num­ber of ob­ser­va­tions grows.

In­for­mally (and too evoca­tively) stated, what seems to be hap­pen­ing is that as long as new ob­ser­va­tions are ex­pand­ing the space of fac­tors seen, ex­treme fac­tors pick out ex­treme out­comes. When new ob­ser­va­tions mostly du­pli­cate already ob­served fac­tors, all of the du­pli­cates would pre­dict the most ex­treme out­come and only one of them can be right.

• Thanks for do­ing what I should have done and ac­tu­ally run some data!

I ran your code in R. I think what is go­ing on in the Cauchy case is that the var­i­ance on fac is way higher than the nor­mal noise be­ing added (I think the SD is set to 1 by de­fault, whilst the Cauchy is rang­ing over some or­ders of mag­ni­tude). If you plot(fac, out), you get a vir­tu­ally straight line, which might ex­plain the lack of di­ver­gence be­tween top ranked fac and out.

I don’t have any an­a­lytic re­sults to offer, but play­ing with R sug­gests in the nor­mal case the prob­a­bil­ity of the great­est fac­tor score pick­ing out the great­est out­come goes down as N in­creases—to see this for your­self, re­place rcauchy with runf or rnorm, and in­crease the N to 10000 or 100000. In the nor­mal case, it is still un­likely that max(fax) picks out max(out) with ran­dom noise, but this prob­a­bil­ity seems to be sam­ple size in­var­i­ant—the rank of the max­i­mum fac­tor re­mains in the same sort of per­centile as you in­crease the sam­ple size.

I can in­tuit why this is the case: in the bi­vari­ate nor­mal case, the dis­tri­bu­tion should be el­lip­ti­cal, and so the limit case with N → in­finity will be steadily re­duc­ing den­sity of ob­ser­va­tions mov­ing out from the el­lipse. So as N in­creases, you are more likely to ‘fill in’ the bul­ges on the el­lipse at the right tail that gives you the di­ver­gence, if the N is smaller, this is less likely. (I find the uniform re­sult more con­fus­ing—the ‘N to in­finity case’ should be a par­allel­o­gram, so you should just be pick­ing out the top right cor­ner, so I’d guess the prob­a­bil­ity of pick­ing out the max fac­tor might be in­var­i­ant to sam­ple size… not sure.)

• Another is­sue is that real-life pro­cesses are, gen­er­ally speak­ing, not sta­tion­ary (in the statis­ti­cal sense) -- out­side of physics, that is.

When you see an ex­treme event in re­al­ity it might be that the un­der­ly­ing pro­cess has heav­ier tails than you thought it does, or it might be that the whole un­der­ly­ing dis­tri­bu­tion switched and all your old es­ti­mates just went out of the win­dow...

• Good point. When I in­tro­duced that toy ex­am­ple with Cauchy fac­tors, it was the eas­iest way to get fac­tors that, in­for­mally, don’t fill in their ob­served sup­port. Let­ting the dis­tri­bu­tion of the fac­tors drift would be a more re­al­is­tic way to achieve this.

the whole un­der­ly­ing dis­tri­bu­tion switched and all your old es­ti­mates just went out of the win­dow...

I like to hope (and should prob­a­bly en­deavor to en­sure) that I don’ t find my­self in situ­a­tions like that. A sys­tem that gen­er­a­tively (what the joint dis­tri­bu­tion of fac­tor X and out­come Y looks like) evolves over time, might be dis­crim­i­na­tively (what the con­di­tional dis­tri­bu­tion of Y looks like given X) sta­tion­ary. Even if we have to throw out our in­for­ma­tion about what new X’s will look like, we may be able to keep say­ing use­ful things about Y once we see the cor­re­spond­ing new X.

• I like to hope (and should prob­a­bly en­deavor to en­sure) that I don’ t find my­self in situ­a­tions like that.

It comes with cer­tain ter­ri­to­ries. For ex­am­ple, any time you see the fi­nan­cial press talk about a six-sigma event you can be pretty sure the un­der­ly­ing dis­tri­bu­tion ain’t what it used to be :-/​

• Upvoted. I re­ally like the ex­pla­na­tion.

In the spirit of Don’t Ex­plain False­hoods, it would be nice to test the ubiquity of this phe­nomenon by spec­i­fy­ing a mea­sure of this phe­nomenon (e.g. cor­re­la­tion) on some rep­re­sen­ta­tive ran­domly-cho­sen pairs. But I don’t mean to sug­gest that you should have done that be­fore post­ing this.

• I was a lit­tle too lazy to knock this up in R. Sorry! I am plan­ning on some fol­lowups when I’ve lev­el­led up more in math­e­mat­ics and pro­gram­ming, al­though my thoughts would be quant fi­nance etc. would have a large liter­a­ture on this, as I’d in­tuit these sorts of effects are pretty im­por­tant when pick­ing stocks etc.

• Good post.

If the el­lipse is very nar­row, things are in­deed well-mod­eled by a lin­ear re­la­tion­ship, and the biggest Y co­or­di­nate for a point is likely to also have close to biggest X co­or­di­nate.

If the el­lipse is not nar­row, that could be for two rea­sons. Either the un­der­ly­ing truth is in­deed lin­ear, but your data is very noisy. Or the un­der­ly­ing truth is not lin­ear, and you should not use a lin­ear model. (Or both, nat­u­rally).

If the un­der­ly­ing truth is lin­ear, but your data is very noisy, then what hap­pens to the X co­or­di­nate of points with given Y val­ues is mostly de­ter­mined by the noise.

If the un­der­ly­ing truth is not lin­ear, why should we ex­pect sen­si­ble an­swers from a lin­ear model?

• If the un­der­ly­ing truth is not lin­ear, why should we ex­pect sen­si­ble an­swers from a lin­ear model?

Be­cause in many fields, lin­ear mod­els (even poor ones) are the best we’re go­ing to get, with more com­plex mod­els los­ing to overfit­ting.

• I don’t fol­low you. Overfit­ting hap­pens when your model has too many pa­ram­e­ters, rel­a­tive to the amount of data you have. It is true that lin­ear mod­els may have few pa­ram­e­ters com­pared to some non-lin­ear mod­els (for ex­am­ple lin­ear re­gres­sion mod­els vs re­gres­sion mod­els with ex­tra in­ter­ac­tion pa­ram­e­ters). But surely, we can have sparsely pa­ram­e­ter­ized non-lin­ear mod­els as well.

All I am say­ing is that if things are sur­pris­ing it is ei­ther due to “noise” (var­i­ance) or “get­ting the truth wrong” (bias). Or both.

I agree that “mod­els we can quickly and eas­ily use while un­der pub­lish-or-per­ish pres­sure” is an im­por­tant class of mod­els in prac­tice :). More­over, lin­ear mod­els are of­ten in this class, while a ton of very in­ter­est­ing non-lin­ear mod­els in stats are not, and thus are rarely used. It is a pity.

• A tech­ni­cal difficulty with say­ing that overfit­ting hap­pens when there are “too many pa­ram­e­ters” is that the pa­ram­e­ters may do ar­bi­trar­ily com­pli­cated things. For ex­am­ple they may en­code C func­tions, in which case a model with a sin­gle (in­finite-pre­ci­sion) real pa­ram­e­ter can fit any­thing very well! Func­tions that are lin­ear in their pa­ram­e­ters and in­puts do not suffer from this prob­lem; the num­ber of pa­ram­e­ters sum­ma­rizes their overfit­ting ca­pac­ity well. The same is not true of some non­lin­ear func­tions.

To avoid con­fu­sion it may be helpful to define overfit­ting more pre­cisely. The gist of any rea­son­able defi­ni­tion of overfit­ting is: If I ran­domly per­turb the de­sired out­puts of my func­tion, how well can I find new pa­ram­e­ters to fit the new out­puts? I can’t do a good job of giv­ing more de­tail than that in a short com­ment, but if you feel con­fused about overfit­ting, here’s a good (and fa­mous) ar­ti­cle about fre­quen­tist learn­ing the­ory by Vladimir Vap­nik that may be use­ful:

http://​​web.mit.edu/​​6.962/​​www/​​www_spring_2001/​​emin/​​slt.pdf

• This is about “rea­son­able en­cod­ing” not “lin­ear­ity,” though. That is, lin­ear func­tions of pa­ram­e­ters en­code rea­son­ably, but not all rea­son­able en­cod­ings are lin­ear. We can define a pa­ram­e­ter to be pre­cisely one bit of in­for­ma­tion, and then ask for the min­i­mum of bits needed.

I don’t un­der­stand why peo­ple are so hung up on lin­ear­ity.

• I don’t fol­low you. Overfit­ting hap­pens when your model has too many pa­ram­e­ters, rel­a­tive to the amount of data you have. It is true that lin­ear mod­els may have few pa­ram­e­ters com­pared to some non-lin­ear mod­els (for ex­am­ple lin­ear re­gres­sion mod­els vs re­gres­sion mod­els with ex­tra in­ter­ac­tion pa­ram­e­ters). But surely, we can have sparsely pa­ram­e­ter­ized non-lin­ear mod­els as well.

Sure, tech­ni­cally if Alice fits a small noisy data set as y(x) = a*x+b and Bob fits it as y(x) = c*Ai(d*x) (where Ai is the Airy func­tion) they’ve used the same num­ber of pa­ram­e­ters, but that won’t stop me from rol­ling my eyes at the lat­ter un­less he has a good first-prin­ci­ple rea­son to priv­ilege the hy­poth­e­sis.

• The prob­lem is more prac­ti­cal than the­o­ret­i­cal (don’t have the links to hand. but you can find some in my silos of ex­per­tise post). Statis­ti­ci­ans do not ad­just prop­erly for ex­tra de­grees of free­dom, so among some cat­e­gory of pub­lished mod­els, the lin­ear ones will be best. Also, it seems that lin­ear mod­els are very good for mod­el­ling hu­man ex­per­tise—we might think we’re com­plex, but we be­have pretty lin­early.

• “Statis­ti­ci­ans” is a pretty large set.

I still don’t un­der­stand your origi­nal “be­cause.” I am talk­ing about mod­el­ing the truth, not mod­el­ing what hu­mans do. If the truth is not lin­ear and hu­mans use a lin­ear mod­el­ing al­gorithm, well then they aren’t a very good role model are they?

[ edit: did not down­vote. ]

• Be­cause hu­man flaws creep in in the pro­cess of mod­el­ling as well. Tak­ing non lin­ear re­la­tion­ships into ac­count (un­less there is a causal rea­son to do so) is ask­ing for statis­ti­cal trou­ble un­less you very care­fully ac­count for how many mod­els you have tested and tried (which al­most no­body does).

• Tak­ing non lin­ear re­la­tion­ships into ac­count (un­less there is a causal rea­son to do so) is ask­ing for statis­ti­cal trou­ble un­less you very care­fully ac­count for how many mod­els you have tested and tried (which al­most no­body does).

First, the struc­ture of your model should be driven by the struc­ture you’re ob­serv­ing in your data. If you are ob­serv­ing non­lin­ear­i­ties, you’d bet­ter model non­lin­ear­i­ties.

Se­cond, I don’t buy that go­ing be­yond lin­ear mod­els is ask­ing for statis­ti­cal trou­ble. It just ain’t so. Peo­ple who overfit can (and ac­tu­ally do, all the time) stuff a ton of vari­ables into a lin­ear model and suc­cess­fully overfit this way.

• And the num­ber of terms ex­plode when you add non lin­ear­i­ties.

5 in­de­pen­dent vari­ables with quadratic terms give you 21 val­ues to play with (1 con­stant + 5 lin­ear + 15 quadratic); it’s much eas­ier to jus­tify con­cep­tu­ally “lets look at quadratic terms” than “lets add in 15 ex­tra vari­ables” even though the effect on de­grees of free­dom is the same.

• And the num­ber of terms ex­plode when you add non linearities

No, they don’t. You con­trol the num­ber of de­grees of free­dom in your mod­els. If you don’t, lin­ear mod­els won’t help you much, and if you do lin­ear­ity does not mat­ter.

5 in­de­pen­dent vari­ables with quadratic terms give you 21 val­ues to play with

I think you’re con­fus­ing quadratic terms and in­ter­ac­tion terms. It also seems that you’re think­ing of lin­ear mod­els solely as lin­ear re­gres­sions. Do you con­sider, e.g. GLMs to be “lin­ear” mod­els? What about trans­for­ma­tions of in­put vari­ables, are they dis­al­lowed in your un­der­stand­ing of lin­ear mod­els?

• I’m talk­ing about prac­tice, not the­ory. And most of the prac­ti­cal re­sults that I’ve seen is that re­gres­sion lin­ear mod­els are full of overfit­ting if they aren’t lin­ear. Even be­yond hu­man er­ror, it seems that in many so­cial sci­ence ar­eas the data qual­ity is poor enough that adding non-lin­ear­i­ties can be seen, a pri­ori, to be a bad thing to do.

Ex­cept of course if there is a firm rea­son to add a par­tic­u­lar non-lin­ear­ity to the prob­lem.

I’m not fa­mil­iar with the whole spec­trum of mod­els (re­gres­sion mod­els, beta dis­tri­bu­tions, some con­ju­gate prior dis­tri­bu­tions, and some ma­chine learn­ing tech­niques is about all I know), so I can’t con­fi­dently speak about the gen­eral case. But, ex­trap­o­lat­ing from what I’ve seen and known bi­ases and in­cen­tives, I’m quite con­fi­dent in pre­dict­ing that generic mod­els are much more likely to be overfit­ted than to have too few de­grees of free­dom.

• I’m quite con­fi­dent in pre­dict­ing that generic mod­els are much more likely to be overfit­ted than to have too few de­grees of free­dom.

Oh, I agree com­pletely with that. How­ever there are a bunch of forces which make it so start­ing with the pub­li­ca­tion bias. Restrict­ing the al­lowed classes of mod­els isn’t go­ing to fix the prob­lem.

It’s like ob­serv­ing that teenagers overuse makeup and de­cid­ing that a good way to deal with that would be to sell lip­stick only in three col­ors—black, brown, and red. Not only it’s not a solu­tion, it’s not even wrong :-/​

the data qual­ity is poor enough that adding non-lin­ear­i­ties can be seen, a pri­ori, to be a bad thing to do.

Why do you be­lieve that a straight-line fit should be the a pri­ori de­fault in­stead of e.g. a log or a power-law line fit?

• Restrict­ing the al­lowed classes of model isn’t go­ing to fix the prob­lem.

I dis­agree; it would help at the very least. I would re­quire lin­ear mod­els only, un­less a) there is a jus­tifi­ca­tion for non-lin­ear terms or b) there is enough data that the re­sult is still sig­nifi­cant even if we in­serted all the de­grees of free­dom that the de­gree of non-lin­ear­i­ties would al­low.

Why do you be­lieve that a straight-line fit should be the a pri­ori de­fault in­stead of e.g. a log or a power-law line fit?

In most cases I’ve seen in the so­cial sci­ence, the di­rec­tion of the effect is of paramount im­por­tance, the other fac­tor less so. It would prob­a­bly be perfectly fine to re­strict to only lin­ear, only log, or only power-law; it’s the mix­ing of differ­ent ap­proaches that ex­plodes the de­grees of free­dom. And in prac­tice let­ting peo­ple have one or the other just al­lows them to test all three be­fore re­port­ing the best fit. So I’d say pick one class and stick with it.

• there is enough data that the re­sult is still sig­nifi­cant even if we in­serted all the de­grees of free­dom that the de­gree of non-lin­ear­i­ties would al­low.

I think this trans­lates to “Calcu­late the sign­fi­cance cor­rectly” which I’m all for, lin­ear mod­els in­cluded :-)

Other­wise, I still think you’re con­fused be­tween the model class and the model com­plex­ity (= de­grees of free­dom), but we’ve set out our po­si­tions and it’s fine that we con­tinue to dis­agree.

• I’m quite con­fi­dent in pre­dict­ing that generic mod­els are much more likely to be overfit­ted than to have too few de­grees of free­dom.

It’s easy to reg­u­larize es­ti­ma­tion in a model class that’s too rich for your data. You can’t “un­reg­u­larize” a model class that’s re­stric­tive enough not to con­tain an ad­e­quate ap­prox­i­ma­tion to the truth of what you’re mod­el­ing.

• How do I ac­count for how many mod­els I’ve tested? No, re­ally, I don’t know what that’d even be called in the statis­tics liter­a­ture, and it seems like if a gen­eral tech­nique for do­ing this were known the big data peo­ple would be all over it.

• What we’re do­ing at the FHI is act­ing like a ma­chine learn­ing prob­lem: split­ting the data into a train­ing and a test­ing set, check­ing as much as we want on the train­ing set, for­mu­lat­ing the hy­pothe­ses, then test­ing them on the test­ing set.

• The Bayesian ap­proach with mul­ti­ple mod­els seems to be ex­actly what we need. eg http://​​www.stat.wash­ing­ton.edu/​​raftery/​​Re­search/​​PDF/​​socmeth1995.pdf

• Another ap­proach seems to be step­wise re­gres­sion: http://​​en.wikipe­dia.org/​​wiki/​​Step­wise_regression

• I see a lot of step­wise re­gres­sion be­ing used by non-statis­ti­ci­ans, but I think statis­ti­ci­ans them­selves think its some­thing of a joke. If you have more pre­dic­tors than you can fit co­effi­cients for, and want an un­der­stand­able lin­ear model you are bet­ter off with some­thing like LASSO.

Edit: Don’t just take my word for it, google found this blog post for me: http://​​an­drewgel­man.com/​​2014/​​06/​​02/​​hate-step­wise-re­gres­sion/​​

• I con­cur. Step­wise re­gres­sion is a very crude tech­nique.

I find it use­ful as an ini­tial filter if I have to dig through a LOT of po­ten­tial pre­dic­tors, but you can’t rely on it to pro­duce a de­cent model.

• So it wasn’t as clear with the pre­vi­ous link, but it seems to me that the nth step of this method doesn’t con­di­tion on the fact that the last n-1 steps failed.

• If you ar­ray the full might of statis­tics/​ma­chine learn­ing/​knowl­edge rep­re­sen­ta­tion in AI/​math/​sig­nal pro­cess­ing, and took the very best, I am very sure they could beat a lin­ear model for a non-lin­ear ground truth very eas­ily. If so, maybe the right thing to do here is to em­u­late those peo­ple when do­ing data anal­y­sis, and not use the model we know to be wrong.

• Proper Bayesi­anism will triumph! But not in the hands of ev­ery­one.

• Be­cause in many fields, lin­ear mod­els (even poor ones) are the best we’re go­ing to get, with more com­plex mod­els los­ing to overfit­ting.

That’s priv­ileg­ing a par­tic­u­lar class of mod­els just be­cause they his­tor­i­cally were easy to calcu­late.

If you’re con­cerned about overfit­ting you need to be care­ful with how many pa­ram­e­ters are you us­ing, but that does not trans­late into an au­to­matic ad­van­tage of a lin­ear model over, say, a log one.

The ar­ti­cle you linked to goes to pre-(per­sonal)com­puter times when deal­ing with non-lin­ear mod­els was of­ten just im­prac­ti­cal.

• Be­cause in many fields, lin­ear mod­els (even poor ones) are the best we’re go­ing to get, with more com­plex mod­els los­ing to overfit­ting.

I don’t think that’s true. What fields show op­ti­mal perfor­mance from lin­ear mod­els where bet­ter pre­dic­tions can’t be got­ten from other tech­niques like de­ci­sion trees or neu­ral nets or en­sem­bles of tech­niques?

Show­ing that crude lin­ear mod­els, with no form of reg­u­lariza­tion or pri­ors, beats hu­man clini­cal judge­ment, doesn’t show your pre­vi­ous claim.

• Model­ling hu­man clini­cal judge­ment is best done with lin­ear mod­els, for in­stance.

• Best done? Bet­ter than, say, de­ci­sion trees or ex­pert sys­tems or Bayesian be­lief net­works? Ci­ta­tion needed.

• Gold­berg, Lewis R. “Sim­ple mod­els or sim­ple pro­cesses? Some re­search on clini­cal judg­ments.” Amer­i­can Psy­chol­o­gist 23.7 (1968): 483.

• 1968? Se­ri­ously?

• Well there’s Gold­berg, Lewis R. “Five mod­els of clini­cal judg­ment: An em­piri­cal com­par­i­son be­tween lin­ear and non­lin­ear rep­re­sen­ta­tions of the hu­man in­fer­ence pro­cess.” Or­ga­ni­za­tional Be­hav­ior and Hu­man Perfor­mance 6.4 (1971): 458-479.

The main thing is that these old pa­pers seem to still be con­sid­ered valid, see eg Shanteau, James. “How much in­for­ma­tion does an ex­pert use? Is it rele­vant?.” Acta Psy­cholog­ica 81.1 (1992): 75-86.

• (It would be nice if you would link ful­l­text in­stead of pro­vid­ing cita­tions; if you don’t have ac­cess to the ful­l­text, it’s a bad idea to cite it, and if you do, you should provide it for other peo­ple who are try­ing to eval­u­ate your claims and whether the pa­per is rele­vant or wrong.)

I’ve put up the first pa­per at https://​​dl.drop­boxuser­con­tent.com/​​u/​​85192141/​​1971-gold­berg.pdf /​​ https://​​pdf.yt/​​d/​​Ux7RZXbo0n374dUU I don’t think this is par­tic­u­larly rele­vant: it only shows that 2 very spe­cific equa­tions (pg4, #3 & #4) did not out­perform the lin­ear model on a par­tic­u­lar dataset. Too bad for Ein­horn 1971.

Your sec­ond pa­per doesn’t sup­port the claims:

A third pos­si­bil­ity is that in­cor­rect meth­ods were used to mea­sure the amount of in­for­ma­tion in ex­perts’ judg­ments; use of the “cor­rect” mea­sure­ment method might sup­port the In­for­ma­tion-Use Hy­poth­e­sis. In the stud­ies re­ported here, four tech­niques were used to mea­sure in­for­ma­tion use: pro­to­col anal­y­sis, mul­ti­ple re­gres­sion anal­y­sis, anal­y­sis of var­i­ance, and self-rat­ings by judges. De­spite differ­ences in mea­sure­ment meth­ods, com­pa­rable re­sults were re­ported. Other method­olog­i­cal is­sues might be raised, but the stud­ies seem varied enough to rule out any ar­ti­fac­tual ex­pla­na­tion.

Th­ese aren’t very good meth­ods for ex­tract­ing the full mea­sure of in­for­ma­tion.

So to sum­ma­rize: re­al­ity isn’t en­tirely lin­ear, so non­lin­ear meth­ods fre­quently ex­cel with mod­ern de­vel­op­ments to reg­u­larize and avoid overfit­ting (we can see this in the low prevalence of lin­ear meth­ods in de­mand­ing AI tasks like image recog­ni­tion, or more gen­er­ally, com­pe­ti­tions like Kag­gle on all sorts of do­mains); to the ex­tent that hu­mans are good pre­dic­tors and clas­sifiers too of re­al­ity, their pre­dic­tions/​clas­sifi­ca­tions will be bet­ter mimicked by non­lin­ear meth­ods; re­search show­ing the con­trary typ­i­cally does not com­pare very good meth­ods and much more re­cent re­search may do much bet­ter (for ex­am­ple, pa­role/​re­ci­di­vism pre­dic­tions by pa­role boards may be bad and eas­ily im­proved on by lin­ear mod­els, but does that mean al­gorithms can’t do even bet­ter?), and to the ex­tent lin­ear meth­ods suc­ceed, it may re­flect the lack of rele­vant data or in­her­ent ran­dom­ness of re­sults for a par­tic­u­lar cher­ryp­icked task.

To show your origi­nal claim (“in many fields, lin­ear mod­els (even poor ones) are the best we’re go­ing to get, with more com­plex mod­els los­ing to overfit­ting”), I would want to see lin­ear mod­els steadily beat all com­ers, from ran­dom forests to deep neu­ral net­works to en­sem­bles of all of the above, on a wide va­ri­ety of large datasets. I don’t think you can show that.

• I tend to agree with you about mod­els, once overfit­ting is sorted.

to the ex­tent that hu­mans are good pre­dic­tors and clas­sifiers too of re­al­ity, their pre­dic­tions/​clas­sifi­ca­tions will be bet­ter mimicked by non­lin­ear methods

This I’ve still seen no ev­i­dence for.

• I’d say that this is re­gres­sion to the mean. If two vari­ables are cor­re­lated with |r| < 1, then ex­treme val­ues on one vari­able will be as­so­ci­ated with some­what less ex­treme val­ues on the other vari­able. So peo­ple who are +4 SD in height will tend to be less than +4 SD in bas­ket­ball abil­ity, and peo­ple who are +4 SD in bas­ket­ball abil­ity will tend to be less than +4 SD in height.

• In­ter­est­ing read! That makes sense.

One lit­tle side note, though.

So, cer­i­tus paribus,

Did you mean ce­teris paribus?

(Ha, fi­nally a chance for me as a lan­guage geek to con­tribute some­thing to all the math talk. :P )

• Thank you for point­ing that high IQ prob­lem is prob­a­bly a statis­ti­cal effect rather than “too much of a good thing” effect. That was very in­ter­est­ing.

Let me at­tempt the prob­lem from a sim­ple math­e­mat­i­cal point of view.

Let bas­ket­ball play­ing abil­ity, Z, is just a sum of height, X, and ag­ility, Y. Both X and Y are Gaus­sian dis­tributed with mean 0 and var­i­ance 1. As­sume X and Y are in­de­pen­dent.

So, if we know that Z>4, what is the most prob­a­ble com­bi­na­tion of X and Y?

The prob­a­bil­ity of X>2 and Y>2 is: P(X>2)P(Y>2)=5.2e-4

The prob­a­bil­ity of X>3 and Y>1 is: P(X>3)P(Y>1)=2.1e-4

So it is more than two times more likely for both abil­ities to be +2Std than one them is +3Std and the other is +1Std.

I think it can be shown rigor­ously that the most prob­a­ble com­bi­na­tion is Z/​N for each com­po­nent if there are N in­de­pen­dent iden­ti­cally dis­tributed com­po­nents of an abil­ity.

• Given a cor­re­la­tion, the en­velope of the dis­tri­bu­tion should form some sort of ellipse

That isn’t an ex­pla­na­tion, but a stronger claim. Why should it form an el­lipse?

A model of an in­de­pen­dent fac­tor or noise is an ex­pla­na­tion of the el­lipse, and thus of the main point. But peo­ple may find a stum­bling block this mid­dle sec­tion, with its as­ser­tion that we should ex­pect el­lipses. Also, re­gres­sion to the mean and the tails com­ing apart are much more gen­eral than el­lipses, but el­lipses are pretty com­mon.

It gen­er­ally vin­di­cates wor­ries about re­gres­sion to the mean

It is re­gres­sion to the mean, as you your­self say el­se­where. I’m not sure what you are try­ing to say here; maybe that peo­ple’s vague wor­ries about re­gres­sion to the mean are us­ing the tech­ni­cal con­cept cor­rectly?

• Why should it form an el­lipse?

Mul­ti­vari­ate CLT per­haps? The pre­con­di­tion seems like it might be a bit less com­mon than the reg­u­lar cen­tral limit the­o­rem, but still plau­si­ble, if you as­sume x and y are cor­re­lated by be­ing af­fected by a third fac­tor, z, which con­trols the terms that sum to­gether to make x and y.

Once you have a mul­ti­vari­ate nor­mal dis­tri­bu­tion, you’re good, since they always have (hy­per-)el­lip­ti­cal en­velopes.

• This post has been a core part of how I think about Good­hart’s Law. How­ever, when I went to search for it just now, I couldn’t find it, be­cause I was us­ing Good­hart’s Law as a search term, but it doesn’t ap­pear any­where in the text or in the com­ments.

So, I thought I’d men­tion the con­nec­tion, to make this post eas­ier for my fu­ture self and oth­ers to find. Also, other forms of this in­clude:

Maybe it would be use­ful to map out as many of the forms of Good­hart’s Law as pos­si­ble, Turchin style.

• Also: this is very rem­i­nis­cent of St. Rev’s old post about eco­nomic in­equal­ity. http://​​st-rev.live­jour­nal.com/​​383957.html

• Isn’t the far sim­pler and more likely sce­nario that you never have just one vari­able ac­count­ing for all of an out­come? If other vari­ables are not perfectly cor­re­lated with the vari­able you are graph­ing you will get noise. Why is it sur­pris­ing that that noise also ex­ists in the most ex­treme points?

EDIT: mi­s­un­der­stood last few para­graphs.

• Statis­ti­cal point: the var­i­ance of fore­cast er­ror for cor­rectly speci­fied sim­ple re­gres­sion prob­lems is equal to:

Sigma^2(1 + 1/​N + (x_o—x_mean)^2 /​ (Sigma ( x_i—x_mean) ^2))

So fore­cast er­ror in­creases as x_o moves away from x_mean, es­pe­cially when the var­i­ance of x is low by com­par­i­son.

Edit: Sub no­ta­tion was ap­par­ently in­dent­ing things. I’m go­ing to take a pic­ture from my stats book tonight. Should be more read­able.

Edit: Here’s a more read­able link. http://​​i.imgur.com/​​pu8lg0Wh.jpg

• (1) Imgur offers edit­ing ca­pa­bil­ities. (2) LW al­lows images.

• The get the image
$\textit{var}(f) = \sigma ^ 2 \left ( 1+\frac{1}{N} + \frac{(x_0 - \bar{x})^2}{\sum (x_i-\bar{x})^2} \right )$
use the fol­low­ing code in your com­ment:

![](http://​​la­tex.codecogs.com/​​png.la­tex?%5Ctex­tit%7Bvar%7D%28f%29%20%3D%20%5Csigma%20%5E%202%20%5Cleft%20%28%201+%5Cfrac%7B1%7D%7BN%7D%20+%20%5Cfrac%7B%28x_0%20-%20%5Cbar%7Bx%7D%29%5E2%7D%7B%5Csum%20%28x_i-%5Cbar%7Bx%7D%29%5E2%7D%20%5Cright%20%29)


See Com­ment for­mat­ting/​Us­ing LaTeX to ren­der math­e­mat­ics on the wiki for more de­tails. I’ve used codecogs ed­i­tor and fixed an is­sue in the URL man­u­ally; there are other op­tions listed on the wiki. The LaTeX code for the codecogs ed­i­tor is this:

\tex­tit{var}(f) = \sigma ^ 2 \left (  1+\frac{1}{N} + \frac{(x_0 - \bar{x})^2}{\sum (x_i-\bar{x})^2} \right )

• Fol­low­ing on your Toy Model con­cept, let’s say the im­por­tant fac­tors in be­ing (for ex­am­ple) a suc­cess­ful en­trepreneur are Per­son­al­ity, In­tel­li­gence, Phys­i­cal Health, and Luck.

If a given per­son has ex­cel­lent (+3SD) in all but one of the cat­e­gories, but only av­er­age or poor in the fi­nal cat­e­gory, they’re prob­a­bly not go­ing to suc­ceed. Poor health, or bad luck, or bad peo­ple skills, or lack of in­tel­li­gence can keep an en­trepreneur at medi­ocrity for their pro­duc­tive ca­reer.

Really any com­pet­i­tive venue can be sub­ject to this anal­y­sis. What are the im­por­tant skills? Does it make sense to treat them as semi-in­de­pen­dent, and semi-mul­ti­plica­tive in ar­riv­ing at the fi­nal score?

• It might give a use­ful heuris­tic in fields where suc­cess is strongly mul­ti­fac­to­rial—if you aren’t at least do­ing well at each sub-fac­tor, don’t bother en­ter­ing. It might not work so well when there’s a case that suc­cess al­most wholly loads on one fac­tor and there might be more ‘thresh­olds’ for oth­ers (e.g. to do the­o­ret­i­cal physics, you ba­si­cally need to be ex­tremely clever, but also suffi­ciently men­tally healthy and able to com­mu­ni­cate with oth­ers).

I’m in­ter­ested in the dis­tri­bu­tion of hu­man abil­ity into the ex­treme range, and I plan to write more on it. My cur­rent (very ten­ta­tive) model is that the fac­tors are com­monly ad­di­tive, not mul­ti­plica­tive. A proof for this is alas too long for this com­box to con­tain, etc. etc. ;)

• For busi­ness in par­tic­u­lar I think net­work size and effects are the rea­son that the very top end of earn­ers are much more de­viant in earn­ings than in in­tel­lect. The fact that you can cap­ture en­tire billions of dol­lars mar­kets be­cause mod­ern so­ciety al­lows a sin­gle product to be dis­tributed wor­ld­wide will mul­ti­ply the value of the “top” product by a lot more than its qual­ity might jus­tify.

• In­ter­est­ing post. Well thought out, with an origi­nal an­gle.

In the di­rec­tion of con­struc­tive feed­back, con­sider that the con­cept of sam­ple size—while it seems to help with the heuris­tic ex­pla­na­tion—likely just mud­dies the wa­ter. (We’d still have the effect even if there were plenty of points at all val­ues.)

For ex­am­ple, sup­pose there were so many peo­ple with ex­treme height some of them also had ex­treme ag­ility (with in­finite sam­ple size, we’d even re­li­ably have that the best play­ers we’re also the tallest.) So: some of the tallest peo­ple are also the best bas­ket­ball play­ers. How­ever, as you ar­gued, most of the tallest won’t be the most ag­ile also, so most of the tallest are not the best (con­trary to what would be pre­dicted by their height alone).

In con­trast, if av­er­age height cor­re­lates with av­er­age bas­ket­ball abil­ity, the other nec­es­sary con­di­tion for a bas­ket­ball player with av­er­age height to have av­er­age abil­ity is to have av­er­age ag­ility—but this is easy to satisfy. So most peo­ple with av­er­age height fit the pre­dic­tion of av­er­age abil­ity.

Like­wise, the short­est peo­ple aren’t likely to have the low­est ag­ility, so the cor­re­la­tion pre­dic­tion fails at that tail too.

Some of the ‘math’ is that it is easy to be av­er­age in all vari­ables ( say, (.65)^n where n is the num­ber of vari­ables) but the prob­a­bil­ity of be­ing stan­dard de­vi­a­tions ex­treme in all vari­ables is hard (say, (.05)^n to be in the top 5 per­cent.) Other math can be used to find the the­o­retic shape for these as­sump­tions (e. g., is it an el­lipse?).

• We’d still have the effect even if there were plenty of points at all val­ues.

Are you talk­ing about rel­a­tive sam­ple sizes, or ab­solute? The effect re­quires that as you go from +4sd to +3sd to +2sd, your pop­u­la­tion in­creases suffi­ciently fast. As long as that holds, it doesn’t go away if the to­tal pop­u­la­tion grows. (But that’s be­cause if you get lots of points at +4sd, then you have a smaller num­ber at +5sd. So you don’t have “plenty of points at all val­ues”.)

If you have equal num­bers at +4 and +3 and +2, then most of the +4 still may not be the best, but the best is likely to be +4.

(Warn­ing: I did not ac­tu­ally do the math.)

• I don’t be­lieve we dis­agree on any­thing. For ex­am­ple, I agree with this:

If you have equal num­bers at +4 and +3 and +2, then most of the +4 still may not be the best, but the best is likely to be +4.

Are you talk­ing about rel­a­tive sam­ple sizes, or ab­solute?

By ‘plenty of points’… I was imag­in­ing that we are tak­ing a finite sam­ple from a the­o­ret­i­cally in­finite pop­u­la­tion. A per­son de­cides on a den­sity that rep­re­sents ‘plenty of points’ and then keeps adding to the sam­ple un­til they have that den­sity up to a cer­tain speci­fied sd.

• Fan­tas­tic, I wish I’d had this back when al­most ev­ery­one in LW/​EA cir­cles I met was read­ing the bi­og­ra­phy of ev­ery­one in the ′ for­tune 400 and try­ing to spot the com­mon fac­tors. A sur­pris­ingly com­mon strat­egy that’s likely not to work for ex­actly these rea­sons.

• My guess is that there are sev­eral vari­ables that are in­deed pos­i­tively cor­re­lated through­out the en­tire range, but are par­tic­u­larly highly cor­re­lated at the very top. Why not? I’m pretty sure we can come up with a list.

• What is in­ter­est­ing is the strength of these re­la­tion­ships ap­pear to de­te­ri­o­rate as you ad­vance far along the right tail.

I read that claim as say­ing that if you sam­ple the 45% to 55% per­centile you will get a stronger cor­re­la­tion than if you sam­ple the 90% to 100% per­centile. Is that what you are ar­gu­ing?

• This was badly writ­ten, es­pe­cially as it offers con­fu­sion with range re­stric­tion. Sorry! I should just have said “what is in­ter­est­ing is that ex­treme val­ues of the pre­dic­tors pre­dic­tors sel­dom pick out the most ex­treme out­comes”.

• If you think you know know how to write it bet­ter, feel free to edit.

• 45% to 55% of what mea­sure? Part of the point of this is that how you cut your sam­ple will change these things.

If you take it as 45% to 55% of one of the other con­tribut­ing fac­tors, then the cor­re­la­tion should be much stronger!

• Does he ar­gue it for any mea­sure? Height for the bas­ket­ball play­ers?

• I don’t think there’s any­thing spe­cial about the tails.

Take a sheet of pa­per, and cover up the left 910 of the high-cor­re­la­tion graph. That leaves the right tail of the X vari­able. The re­main­ing dat­a­points have a much less lin­ear shape.

But: take two sheets of pa­per, and cover up (say) the left 410, and the right 510. You get the same shape left over! It has noth­ing to do with the tail—it just has to do with com­press­ing the range of X val­ues.

The cor­re­la­tion, roughly speak­ing, tells you what per­centage of the vari­a­tion is not caused by ran­dom er­ror. When you com­press the X, you com­press the “real” vari­a­tion, but leave the “er­ror” vari­a­tion as is. So the cor­re­la­tion drops.

• I agree that range re­stric­tion is im­por­tant, and I think a range-re­stric­tion story can be­come ba­si­cally iso­mor­phic to my post (e.g. “even if some­thing is re­ally strongly cor­re­lated, range re­strict­ing to the top 1% of this dis­tri­bu­tion, this cor­re­la­tion is lost in the noise, so it should not sur­prise us that the biggest X isn’t the biggest Y.”)

My post might be slightly bet­ter for peo­ple who tend to vi­su­al­ize things, and I sup­pose it might have a slight ad­van­tage as it might provide an ex­pla­na­tion why you are more likely to see this as the num­ber of ob­ser­va­tions in­creases, which isn’t so ob­vi­ous when talk­ing about a loss of cor­re­la­tion.

• “At the ex­tremes, other fac­tors may weigh more.”

Noth­ing that hasn’t been said be­fore, and in my opinion bet­ter.

I don’t par­tic­u­larly like your “el­lipse” gen­er­al­iza­tion, ei­ther, be­cause it’s just wrong. We already know a perfect cor­re­la­tion would be lin­ear. We already know a lesser cor­re­la­tion is “fat­ter”. Bring­ing el­lipses into the is­sue is just an in­tu­itive, illus­tra­tive fic­tion, which I re­ally don’t ap­pre­ci­ate very much be­cause it’s not par­tic­u­larly in­for­ma­tive and it isn’t sci­en­tifi­cally sound at all.

Please don’t mi­s­un­der­stand me: I do think it is illus­tra­tive, and I do think it has its place. In the newby sec­tion maybe.

Un­der­stand, I am aware that may come across as overly harsh, but it isn’t meant that way. I’m not try­ing to be im­po­lite. It’s just my opinion and I hon­estly don’t know a bet­ter way to ex­press it right now with­out be­ing dishon­est.

• I don’t par­tic­u­larly like your “el­lipse” gen­er­al­iza­tion, ei­ther, be­cause it’s just wrong. … Bring­ing el­lipses into the is­sue is just an in­tu­itive, illus­tra­tive fic­tion, which I re­ally don’t ap­pre­ci­ate very much be­cause it’s not par­tic­u­larly in­for­ma­tive and it isn’t sci­en­tifi­cally sound at all.

I think you’re mis­taken about that. An el­lipse is the shape of a mul­ti­vari­ate nor­mal dis­tri­bu­tion, for ex­am­ple. In fact, there is the en­tire fam­ily of el­lip­ti­cal dis­tri­bu­tions which are, to quote Wikipe­dia, “a broad fam­ily of prob­a­bil­ity dis­tri­bu­tions that gen­er­al­ize the mul­ti­vari­ate nor­mal dis­tri­bu­tion. In­tu­itively, in the sim­plified two and three di­men­sional case, the joint dis­tri­bu­tion forms an el­lipse and an el­lip­soid, re­spec­tively, in iso-den­sity plots.”

a perfect cor­re­la­tion would be linear

That’s a mean­ingless phrase, cor­re­la­tion is lin­ear by defi­ni­tion. More­over, it’s a par­tic­u­lar mea­sure of de­pen­dency which can be mis­lead­ing.

• It’s just my opinion and I hon­estly don’t know a bet­ter way to ex­press it right now with­out be­ing dishon­est.

A bet­ter way would be to make the crit­i­cisms more con­crete. What does “not par­tic­u­larly in­for­ma­tive and it isn’t sci­en­tifi­cally sound at all” mean? You might, for ex­am­ple, have said some­thing to the effect that the el­lipses are con­tours of the bi­vari­ate nor­mal dis­tri­bu­tion with the same cor­re­la­tion, and pointed out that not all bi­vari­ate dis­tri­bu­tions are nor­mal. But on the other hand the scat­ter­plots pre­sented aren’t so far away from nor­mal that the el­lipses are mis­lead­ing. The el­lipses are in­deed in­tu­itive and illus­tra­tive; but call­ing them “just fic­tion” is an­other way of ex­press­ing crit­i­cism too vague to re­spond to. The point masses and fric­tion­less pul­leys of school physics prob­lems are also fic­tions, but none the worse for that.

This is also vague:

Noth­ing that hasn’t been said be­fore, and in my opinion bet­ter.

(Where, and what did they say? We can­not know what bet­ter re­sources you know of un­less you tell us.)

And this:

I do think it is illus­tra­tive, and I do think it has its place. In the newby sec­tion maybe.

There is no “newby sec­tion” on LessWrong.

Be­sides, you’re talk­ing there about some­thing you pre­vi­ously called “just wrong”. First it’s “just wrong”, then it’s “not par­tic­u­larly in­for­ma­tive”, then it’s “illus­tra­tive”, then “it has its place in the newby sec­tion”. It re­minds me of the old adage about the stages of truth, with the en­tire se­quence here com­pressed into a sin­gle com­ment.

• A bet­ter way would be to make the crit­i­cisms more con­crete.

What isn’t “con­crete” about it? I think the whole ar­ti­cle is an ex­er­cise in stat­ing the ob­vi­ous, to those who have had ba­sic ed­u­ca­tion in statis­tics. Stric­ter cor­re­la­tions tend to be more lin­ear. A broader spec­trum of data points is pretty much by defi­ni­tion “fat­ter”. I don’t see how this is ac­tu­ally very in­struc­tive. And to be hon­est, I don’t see how I could be much more spe­cific.

Where, and what did they say? We can­not know what bet­ter re­sources you know of un­less you tell us.

You mean you’ve never had a statis­tics class? Hon­estly? I’m not try­ing to be snide, just ask­ing.

Ex­treme data points are of­ten called “out­liers” for a rea­son. Since (again, al­most—but not quite—by defi­ni­tion, it de­pends on cir­cum­stances) they do not gen­er­ally show as strong a cor­re­la­tion, “other fac­tors may weigh more”. This is a not a rev­e­la­tion. I don’t dis­agree with it, I’m sim­ply say­ing it’s rather el­e­men­tary logic.

Which brings us back to the main point I was mak­ing: I did not feel this was par­tic­u­larly in­struc­tive.

Be­sides, you’re talk­ing there about some­thing you pre­vi­ously called “just wrong”.

Wrong in the sense that I don’t see any ac­tual demon­strated re­la­tion­ship be­tween his el­lipses and the data, ex­cept for sim­ple, rather in­tu­itive ob­ser­va­tion. It’s merely an illus­tra­tive tool. More speci­fi­cally:

So this offers an ex­pla­na­tion why di­ver­gence at the tails is ubiquitous. Pro­vid­ing the sam­ple size is largeish, and the cor­re­la­tion not to tight (the tighter the cor­re­la­tion, the larger the sam­ple size re­quired), one will ob­serve the el­lipses with the bulging sides of the dis­tri­bu­tion (2).

This is an in­cor­rect state­ment. What he is offer­ing is a way to de­scribe how data at the ex­treme ends may vary from cor­re­la­tion. Not “why”. There is noth­ing here es­tab­lish­ing cau­sa­tion.

If we are to be “less wrong”, then we should en­deavor to not make con­fused com­ments like that.