# Named Distributions as Artifacts

## On con­fus­ing pri­ors with models

Be­ing Abra­ham de Moivre and be­ing born in the 17th cen­tury must have been a re­ally sad state of af­fairs. For one, you have the bubonic plague thing go­ing on, but even worse for de Moivre, you don’t have com­put­ers and sen­sors for au­to­mated data col­lec­tion.

As some­one in­ter­ested in com­plex-real world pro­cesses in the 17th cen­tury, you must col­lect all of your ob­ser­va­tions man­u­ally, or waste your tiny for­tune on pay­ing un­re­li­able work­ers to col­lect it for you. This is made even more com­pli­cated by the whole “dan­ger­ous to go out­side be­cause of the bubonic plague” thing.

As some in­ter­ested in chance and ran­dom­ness in the 17th cen­tury, you’ve got very few mod­els of “ran­dom” to go by. Sam­pling a uniform ran­dom vari­able is done by liter­ally rol­ling dice or toss­ing a coin. Dice which are im­perfectly crafted, as to not be quite uniform. A die (or coin) which limits the pos­si­bil­ities of your ran­dom vari­able be­tween 2 and 20 (most likely 12, but I’m not dis­count­ing the ex­is­tence of me­dieval d20s). You can or­der cus­tom dice, you can roll mul­ti­ple dice and com­bine the re­sults into a num­ber (e.g. 4d10 are enough to sam­ple uniformly from be­tween 1 and 10,000), but it be­comes in­creas­ingly te­dious to gen­er­ate larger num­bers.

Even more hor­ribly, once you te­diously gather all this data, your anal­y­sis of it is limited to pen and pa­per, you are limited to 2 plot­ting di­men­sions, maybe 3 if you are a par­tic­u­larly good artist. Every time you want to vi­su­al­ize 100 sam­ples a differ­ent way you have to care­fully draw a co­or­di­nate sys­tem and te­diously draw 100 points. Every time you want to check if an equa­tion fits your 100 sam­ples, you must calcu­late the equa­tion 100 times… and then you must com­pute the differ­ence be­tween the equa­tion re­sults and your sam­ples, and then man­u­ally use those to com­pute your er­ror func­tion.

I in­vite you to pick two ran­dom dice and throw them un­til you have a vague ex­per­i­men­tal proof that the CLT holds, and then think about the fact that this must have been de Moivrea’s day to day life study­ing proto-prob­a­bil­ity.

So it’s by no ac­ci­dent, mis­for­tune, or ill-think­ing that de Moivre came up with the nor­mal dis­tri­bu­tion [trade­marked by Gauss] and the cen­tral limit the­o­rem (CLT).

A prim­i­tive pre­cur­sor of the CLT is the idea that when two or more in­de­pen­dent ran­dom vari­ables are uniform on the same in­ter­val, their sum will be defined by a nor­mal dis­tri­bu­tion (aka Gaus­sian dis­tri­bu­tion, bell curve).

This is re­ally nice be­cause a nor­mal dis­tri­bu­tion is a rel­a­tively sim­ple func­tion of two vari­ables (mean and var­i­ance), it gives de Moivre a very malle­able math­e­mat­i­cal ab­strac­tion.

How­ever, de Moivre’s defi­ni­tion of the CLT was that, when in­de­pen­dent ran­dom vari­ables (which needn’t be uniform) are nor­mal­ized to the same in­ter­val and then summed, this sum will tend to­wards the nor­mal dis­tri­bu­tion. That is to say, some given nor­mal dis­tri­bu­tion will be able to fit their sum fairly well.

The ob­vi­ous is­sue in this defi­ni­tion is the “tend to­wards” and the “ran­dom” and the “more than two”. There’s some re­la­tion be­tween these, in that, the fewer ways there are for some­thing to be “ran­dom” and the more of these vari­ables one has, the “closer” the sum will be to a nor­mal dis­tri­bu­tion.

This can be for­mal­ized to fur­ther nar­row down the types of ran­dom vari­ables from which one can sam­ple in or­der for the CLT to hold. Once the na­ture of the ran­dom gen­er­a­tors is known, the num­ber of ran­dom vari­ables and sam­ples needed to get within a cer­tain er­ror range of the clos­est pos­si­ble nor­mal dis­tri­bu­tion can also be ap­prox­i­mated fairly well.

But that is beside the point, the pur­pose of the CLT is not to wax lyri­cally about ideal­ized ran­dom pro­cesses, it’s to be able to study the real world.

De Moivre’s father, a proud pa­triot sur­geon, wants to prove the sparkling wines of Cham­pagne have a for­tify­ing effects upon the liver. How­ever, do­ing au­top­sies is hard, be­cause peo­ple don’t like hav­ing their fam­ily corpses “des­e­crated” and be­cause there’s a plague. Still, the good sur­geon gets a sam­pling of livers from nor­mal French­men and from Cham­pagne drink­ing French­men and mea­sures their weight.

Now, nor­mally, he could sum up these weights and com­pare the differ­ence, or look at the small­est and largest weight from each set to figure out the differ­ence be­tween the ex­tremes.

But the sam­ples are very small, and might not be fully rep­re­sen­ta­tive. Maybe, by chance, the heav­iest Cham­pagne drinker liver he was is 2.5kg, but the heav­iest pos­si­ble Cham­pagne drinker liver is 4kg. Maybe the Cham­pagne drinker livers are over­all heav­ier, but there’s one ex­tremely heavy in the con­trol group.

Well, armed with the Nor­mal Distri­bu­tion and the CLT we say:

Liver weight is prob­a­bly a func­tion of var­i­ous un­der­ly­ing pro­cesses and we can as­sume they are fairly ran­dom. Since we can think of them as sum­ming up to yield the liver’s weight, we can as­sume these weights will fall onto a nor­mal dis­tri­bu­tion.

Thus, we can take a few ob­ser­va­tions about the weights:

Nor­mal French­men: 1.5, 1.7, 2.8, 1.6, 1.8, 1.2, 1.1, 1, 2, 1.3, 1.8, 1.5, 1.3, 0.9, 1, 1.1, 0.9, 1.2

Cham­pagne drink­ing French­men: 1.7, 2.5, 2.5, 1.2, 2.4, 1.9, 2.2, 1.7

And we can com­pute the mean and the stan­dard de­vi­a­tion and then we can plot the nor­mal dis­tri­bu­tion of liver weight (cham­pagne drinkers in or­ange, nor­mal in blue):

So, what’s the max­i­mum weight of a Cham­pagne drinker’s liver? 3.3kg

Nor­mal liver? 2.8

Min­i­mum weight? 0.7kg cham­pagne, 0.05kg nor­mal (maybe con­founded by ba­bies that are un­able to drink Cham­pagne, in­ves­ti­gate fur­ther)

But we can also an­swer ques­tion like:

For nor­mal peo­ple, say, 910 peo­ple, the ones that aren’t at the ex­treme, what’s the range of cham­pagne drink­ing vs nor­mal liver weights?

Or ques­tions like:

Given liver weight x, how cer­tain are we this per­son drank Cham­pagne?

Maybe ques­tions like:

Given that I sus­pect my kid has a tiny liver, say 0.7kg, will drink­ing Cham­pagne give him a 45 chance of for­tify­ing it to 1.1kg?

So the CLT seems amaz­ing, at least it seems amaz­ing un­til you take a look at the real world, where:

a) We can’t de­ter­mine the in­ter­val for which most pro­cesses will yield val­ues. You can study the stock mar­ket from 2010 to 2019, see that the DOW daily change is be­tween −3% and +4% and fairly ran­dom, as­sume the DOW price change is gen­er­ated by a ran­dom func­tion with val­ues be­tween −3% and 4%… and then March 2020 rolls around and sud­denly that func­tion yields −22% and you’re ****.

b) Most pro­cesses don’t “ac­cu­mu­late” but rather con­tribute in weird ways to yield their re­sult. Growth rate of straw­ber­ries is f(lumens, water)but if you as­sume you can ap­prox­i­mate f as lumens*a + water*b you’ll get some re­ally weird situ­a­tion where your straw­ber­ries die in a very damp cel­lar or wither away in a desert.

c) Most pro­cesses in the real world, es­pe­cially pro­cesses that con­tribute to the same out­come, are in­ter­con­nected. Ever won­der why nu­tri­tion­ists are sel­dom able to make any con­sis­tent or im­pact­ful claims about an ideal diet? Well, maybe it’ll make more sense if you look at this di­a­gram of metabolic path­way. In the real world, most pro­cesses that lead to a shared out­come are so in­ter­con­nected it’s a her­culean task to even res­cue the pos­si­bil­ity of think­ing causally.

On a sur­face level, my in­tu­itions tell me it’s wrong to make such as­sump­tions, the epistem­i­cally cor­rect way to think about mak­ing in­fer­ences from data ought to go along the lines of:

• You can use the data at your dis­posal, no more

• If you want more data you can ei­ther col­lect more or find a causal mechanism, use it to ex­trap­o­late, prove that it’s true even at edge cases by mak­ing truth­ful pre­dic­tions and use that

OR

• If you want more data, you can ex­trap­o­late given that you know of a very similar dataset to back up your ex­trap­o­la­tion. E.g. if we know the dis­tri­bu­tion of English­men liver sizes, we can as­sume the dis­tri­bu­tion of French­men liver sizes might fit that shape. But this ex­trap­o­la­tion is just a bet­ter-than-ran­dom hy­poth­e­sis, to be scru­ti­nized fur­ther based on the pre­dic­tions stem­ming from it.

Granted, your mileage may vary de­pend­ing on your area of study. But no mat­ter how laxly you in­ter­pret the above, there’s a huge leap to be made from those rules to as­sum­ing the CLT holds on real-world pro­cesses.

But, again, this is the 17th cen­tury, there are no com­put­ers, no fancy data col­lec­tion ap­para­tus, there’s a plague go­ing on and to be fair, loads of things, from the height of peo­ple to grape yield of a vine, seem to fit the nor­mal dis­tri­bu­tion quite well.

De Moivre didn’t have even a minute frac­tion of the op­tions we have now, all things con­sid­ered, he was rather in­ge­nious.

But lest we for­get, the nor­mal dis­tri­bu­tion came about as a tool to:

• Re­duce com­pu­ta­tion time, in a world where peo­ple could only use pen and pa­per and their own mind.

• Make up for lack of data, in a world where “data” meant what one could see around him plus a few dozens of books with some ta­bles at­tached to them.

• Allow tech­niques that op­er­ate on func­tions, to be ap­plied to sam­ples (by ab­stract­ing them as a nor­mal dis­tri­bu­tion). In other words, al­low for con­tin­u­ous math to be ap­plied to a dis­crete world.

## Ar­ti­facts that be­came idols

So, the CLT yields a type of dis­tri­bu­tion that we in­tuit is good for mod­el­ing a cer­tain type of pro­cess and has a bunch of use­ful math­e­mat­i­cal prop­er­ties (in a pre-com­puter world).

We can take this idea, ei­ther by tweak­ing the CLT or by throw­ing it away en­tirely and start­ing from scratch, to cre­ate a bunch of similarly use­ful dis­tri­bu­tion.… a re­ally re­ally large bunch.

For the pur­pose of this ar­ti­cle I’ll re­fer to these as “Named Distri­bu­tions”. They are dis­tri­bu­tions in­vented by var­i­ous stu­dents of math­e­mat­ics and sci­ence through­out his­tory, to tackle some prob­lem they were con­fronting, which proved to be use­ful enough that they got adopted into the reper­toire of com­monly used tools.

All of these dis­tri­bu­tions are se­lected based on two crite­ria:

a) Their math­e­mat­i­cal prop­er­ties, things that al­low statis­ti­ci­ans to more eas­ily ma­nipu­late them, both in terms of com­plex­ity-of-think­ing re­quired to for­mu­late the ma­nipu­la­tions and in terms of com­pu­ta­tion time. Again, re­mem­ber, these dis­tri­bu­tions came about in the age of pen and pa­per … and maybe some ana­log com­put­ers which were very ex­pen­sive and slow to use).

b) Their abil­ity to model pro­cesses or char­ac­ter­is­tics of pro­cesses in the real world.

Based on these two crite­ria we get dis­tri­bu­tions like Stu­dent’s t, the chi-square, the f dis­tri­bu­tion, Bernoulli dis­tri­bu­tion, and so on.

How­ever, when you split the crite­ria like so, it’s fairly easy to see the in­her­ent prob­lem of us­ing Named Distri­bu­tions, they are a con­founded product be­tween:

a) Math­e­mat­i­cal mod­els se­lected for their ease of use b) Pri­ors about the un­der­ly­ing re­al­ity those math­e­mat­ics are ap­plied to

So, take some­thing like Pythagora’s the­o­rem about a square tri­an­gle: h^2 = l1^2 + l2^2.

But, as­sume we are work­ing with im­perfect real-world ob­jects, our “tri­an­gles” are never ideal. A statis­ti­cian might look at 100 tri­an­gu­lar shapes with a square an­gle and say “huh, they seem to vary by at most this much from an ideal 90-de­gree an­gle)”, based on that, he could come up with a for­mula like:

h^2 = l1^ + l2^2 +/- 0.05*max(l1^2,l2^2)

And this might in­deed be a use­ful for­mula, at least if the ob­ser­va­tions of the statis­ti­cian gen­er­al­ize to all right an­gle that ap­peared in the real world, but a THEOREM or a RULE it is not, not so un­der any math­e­mat­i­cal sys­tem from Eu­clid to Rus­sell.

If the hy­potenuse of real-world square tri­an­gle proves to be usu­ally within the bound pre­dicted by this for­mula, and the for­mula gives us a mar­gin of er­ror that doesn’t in­con­ve­nience fur­ther calcu­la­tions we might plug it into, then it may well be use­ful.

But it doesn’t change the fact that to make a state­ment about a “kind of square” tri­an­gle, we need to mea­sure 3 of its pa­ram­e­ters cor­rectly, the for­mula is not a sub­sti­tute for re­al­ity, it’s just a sim­plifi­ca­tion of re­al­ity when we have im­perfect data.

Nor should we lose the origi­nal the­o­rem in fa­vor of this one, be­cause the origi­nal is ac­tu­ally mak­ing a true state­ment about a shared math­e­mat­i­cal world, even if it’s more limited in its ap­pli­ca­bil­ity.

We should re­mem­ber that 0.05*max(l1^2,l2^2)
is an ar­ti­fact stem­ming from the statis­ti­cian’s origi­nal mea­sure­ments, it shouldn’t be­come a holy num­ber, never to change for any rea­son.

If we agree this ap­plied ver­sion of the Pythago­ras the­o­rem is in­deed use­ful, then we should fre­quently mea­sure the right an­gle in the real world, and figure out if the “magic er­ror” has to be changed to some­thing else.

When we teach peo­ple the for­mula we should draw at­ten­tion to the differ­ence: “The left-hand side is Py­athagor’s for­mula, the right-hand side is this ar­ti­fact which is kind of use­ful, but there’s no prop­erty of our math­e­mat­ics that ex­actly define what a slightly-off right-an­gled tri­an­gle is or tells us it should fit this rule”.

The prob­lem with Named Distri­bu­tions is that one can’t re­ally point out the differ­ence be­tween math­e­mat­i­cal truths and the pri­ors within them. You can take the F dis­tri­bu­tion and say “This ex­ists be­cause it’s easy to plug it into a bunch of calcu­la­tion AND be­cause it pro­vides some use­ful pri­ors for how real-world pro­cesses usu­ally be­haves”.

You can’t re­ally ask the ques­tion “How would the equa­tion defin­ing the F-dis­tri­bu­tion change, pro­vided that those pri­ors about the real world would change”. To be more pre­cise, we can ask the ques­tion, but the re­sult would be fairly com­pli­cated and the re­sult­ing F’-dis­tri­bu­tion might hold none of the use­ful math­e­mat­i­cal prop­er­ties of the F-dis­tri­bu­tion.

## Are Named Distri­bu­tions harm­ful?

Hope­fully I’ve man­aged to get across the in­tu­ition for why I don’t re­ally like Named Distri­bu­tions, they are very bad at sep­a­rat­ing be­tween model and prior.

But… on the other hand, ANY math­e­mat­i­cal model we use to ab­stract the world will have a built-in prior.

Want to use a re­gres­sion? Well, that as­sumes the phe­nomenon fits a straight line.

Want to use a 2,000,000 billion pa­ram­e­ter neu­ral net­work? Well, that as­sumes the phe­nomenon fits a 2,000,000 pa­ram­e­ter equa­tion us­ing some com­bi­na­tion of the finite set of op­er­a­tions pro­vided by the net­work (e.g. +, -, >, ==, max, min,avg).

From that per­spec­tive, the pri­ors of any given Named Distri­bu­tions are no differ­ent than the pri­ors of any par­tic­u­lar neu­ral net­work or de­ci­sion tree or SVM.

How­ever, they come to be harm­ful be­cause peo­ple make as­sump­tions about their pri­ors be­ing some­what cor­rect, whilst as­sum­ing the pri­ors of other statis­ti­cal mod­els are some­what in­fe­rior.

Say, for ex­am­ple, I ran­domly define a dis­tri­bu­tion G
and this dis­tri­bu­tion hap­pens to be amaz­ing at mod­el­ing peo­ple’s IQs. I take 10 ho­moge­nous groups of peo­ple and I give them IQ tests, I then fit G
on 50% of the re­sults for each group and then I see if this cor­rectly mod­els the other 50%… it works.

I go fur­ther, I val­i­date G
us­ing k-fold cross-val­i­da­tion and it gets tremen­dous re­sults no mat­ter what folds I end up test­ing it on. Point is, I run a se­ries of tests that, to the best of our cur­rent un­der­stand­ing of event mod­el­ing, shows that G
is bet­ter at mod­el­ing IQ than the nor­mal dis­tri­bu­tion.

Well, one would as­sume I could then go ahead and in­form the neu­rol­ogy/​psy­chol­ogy/​eco­nomics/​so­ciol­ogy com­mu­nity:

Hey, guys, good news, you know how you’ve been us­ing the nor­mal dis­tri­bu­tion for mod­els of IQ all this time… well, I got this new dis­tri­bu­tion G
that works much bet­ter for this spe­cific data type. Here is the data I val­i­dated it on, we should run some more tests and, if they turn up pos­i­tive, start us­ing it in­stead of the nor­mal dis­tri­bu­tion.

In semi-ideal­ized re­al­ity, this might well re­sult in a co­or­di­na­tion prob­lem. That is to say, even as­sum­ing I was talk­ing with math­e­mat­i­cally apt peo­ple that un­der­stood the nor­mal dis­tri­bu­tion not to be magic, they might say:

Look, your G
is in­deed x% bet­ter than the nor­mal dis­tri­bu­tion at mod­el­ing IQ, but I don’t re­ally think x% is enough to ac­tu­ally af­fect any calcu­la­tions, and it would mean scrap­ping or re-writ­ing a lot of old pa­pers, wast­ing our col­leagues time to spread this in­for­ma­tion, try­ing to ex­plain this thing to jour­nal­ists that at least kind of un­der­stand what a bell curve is …etc.

In ac­tual re­al­ity, I’m afraid the an­swer I’d get is some­thing like:

What? This can’t be true, you are ly­ing and or mi­s­un­der­stand­ing stat­ics. All my text­books say that the stan­dard dis­tri­bu­tion mod­els IQ and is what should be used for that, there’s even a thing called the Cen­tral limit the­o­rem prov­ing it, it’s right there in the name “T-H-E-O-R-E-M” that means it’s true.

Maybe the ac­tual re­ac­tion lies some­where be­tween my two ver­sions of the real world, it’s hard to tell. If I were a so­ciol­o­gist I’d sus­pect there’d be some stan­dard set of re­ac­tions μ and they vary by some stan­dard re­ac­tion vari­a­tion co­effi­cient σ
, and 68.2% of re­ac­tions are within the μ +/-σ range, 95.4% are within the μ +/- 2σ range, 99.6% are within the μ +/-3σ  range and the rest are no rele­vant for pub­lish­ing in jour­nals which would benefit my tenure ap­pli­ca­tion so they don’t ex­ist. Then I would go ahead and find some way to define μ and σ such that this would hold true.

Other than a pos­si­bly mis­guided as­sump­tion about the cor­rect­ness of their built-in pri­ors, is there any­thing else that might be harm­ful when us­ing Named Distri­bu­tions?

I think they can also be dan­ger­ous be­cause of their sim­plic­ity, that is, the lack of pa­ram­e­ters that can be tuned when fit­ting them.

Some peo­ple seem to think it’s in­her­ently bad to use com­plex mod­els in­stead of sim­ple ones when avoid­able, I can’t help but think that the peo­ple say­ing this are the same as those that say you shouldn’t quote wikipe­dia. I haven’t seen any real ar­gu­ments for this point that don’t straw-man com­plex mod­els or the peo­ple us­ing them. So I’m go­ing to dis­re­gard it for now; if any­one knows of a good ar­ti­cle/​pa­per/​book ar­gu­ing for why sim­ple mod­els are in­her­ently bet­ter, please tell me and I’ll link to it here.

Edit: After read­ing it di­ag­o­nally, it seems like this + this pro­vides a po­ten­tially good (from my per­spec­tive) defense of the idea that model com­plex­ity is in­her­ently bad, even as­sum­ing you fit prop­erly (i.e. you don’t fit di­rectly onto the whole dataset). Leav­ing this here for now, will give more de­tail when(if) I get the time to look at it more.

On the other hand, sim­ple mod­els can of­ten lead to a poor fit for the data, ul­ti­mately this will lead to shabby eval­u­a­tions of the un­der­ly­ing pro­cess (e.g. if you are eval­u­at­ing a re­sult­ing, poorly fit PDF, rather than the data it­self). Even worst, it’ll lead to mak­ing pre­dic­tions us­ing the poorly fit model and miss­ing out on free in­for­ma­tion that your data has but your model is ig­nor­ing.

Even worst, sim­ple mod­els might cause dis­card­ing of valid re­sults. All our sig­nifi­cance tests are based on very sim­plis­tic PDFs, sim­ply be­cause their in­te­grals had to be solved and com­puted in each point by hand, thus com­plex mod­els were im­pos­si­ble to use. Does this lead dis­card­ing data that doesn’t fit a sim­plis­tic PDF? Since the stan­dard sig­nifi­cant tests might give weird re­sults, due to no fault of the un­der­ly­ing pro­cess, but rather be­cause the gen­er­ated data is un­usual. To be hon­est, I don’t know, I want to say that the an­swer might be “yes” or at least “No, but that’s sim­ply be­cause one can mas­sage the phras­ing of the prob­lem in such a way as to stop this from hap­pen­ing”

I’ve pre­vi­ously dis­cussed the topic of treat­ing neu­ral net­works as a math­e­mat­i­cal ab­strac­tion for ap­prox­i­mat­ing any func­tion. So they, or some other uni­ver­sal func­tion es­ti­ma­tors, could be used to gen­er­ate dis­tri­bu­tions that perfectly fit the data they are sup­posed to model.

If the data is best mod­eled by a nor­mal dis­tri­bu­tion, the func­tion es­ti­ma­tor will gen­er­ate some­thing that be­haves like a nor­mal dis­tri­bu­tion. If we lack data but have some good pri­ors, we can build those pri­ors into the struc­ture and the loss of our func­tion es­ti­ma­tor.

Thus, rely­ing on these kinds of com­plex mod­els seems like a net pos­i­tive.

That’s not to say you should start with a 1,000M pa­ram­e­ter net­work and go from there, but rather, you should start iter­a­tively from the sim­plest pos­si­ble model and keep adding pa­ram­e­ters un­til you get a satis­fac­tory fit.

But that’s mainly for ease of train­ing, easy repli­ca­bil­ity and effi­ciency, I don’t think there’s any par­tic­u­larly strong ar­gu­ment that a 1,000M pa­ram­e­ter model, if trained prop­erly, will be in any way worst than a 2 pa­ram­e­ter model, as­sum­ing similar perfor­mance on the train­ing & test­ing data.

I’ve writ­ten a bit more about this in If Van der Waals was a neu­ral net­work, so I won’t be re­peat­ing that whole view­point here.

## In conclusion

I have a spec­u­la­tive im­pres­sion that statis­ti­cal think­ing is ham­pered by wor­shiping var­i­ous named dis­tri­bu­tion and build­ing math­e­mat­i­cal ed­ifices around them.

I’m at least half-sure that some of those ed­ifices ac­tu­ally help in situ­a­tions where we lack the com­put­ing power to solve the prob­lems they model.

I’m also fairly con­vinced that, if you just scrapped the idea of us­ing Named Distri­bu­tions as a ba­sis for mod­el­ing the world, you’d just get rid of most statis­ti­cal the­ory al­to­gether.

I’m fairly con­vinced that dis­tri­bu­tion-fo­cused think­ing pi­geon­holes our mod­els of the world and stops us from look­ing at in­ter­est­ing data and ask­ing rele­vant ques­tions.

But I’m un­sure those data and ques­tions would be any more en­light­en­ing than what­ever we are do­ing now.

I think my com­plete po­si­tion here is as fol­lows:

1. Named Distri­bu­tions can be dan­ger­ous via the side effect of peo­ple not un­der­stand­ing the strong epistemic pri­ors em­bed in them.

2. It doesn’t seem like us­ing Named Distri­bu­tions to fit a cer­tain pro­cess is worse com­pared to us­ing any other ar­bi­trary model with equal ac­cu­racy on the val­i­da­tion data.

3. How­ever most Named Distri­bu­tions are very sim­plis­tic mod­els and prob­a­bly can’t ac­cu­rately fit most real-world data (or deriva­tive in­for­ma­tion from that data) very well. We have mod­els that can ac­count for more com­plex­ity, with­out any clear dis­ad­van­tages, so it’s un­clear to me why we wouldn’t use those.

4. Named Distri­bu­tions are use­ful be­cause they in­clude cer­tain use­ful pri­ors about the un­der­ly­ing re­al­ity, how­ever, they lack a way to ad­just those pri­ors, which seems like a bad enough thing to over­ride this pos­i­tive.

5. Named Distri­bu­tions might pi­geon­hole us into cer­tain pat­terns of think­ing and bias us to­wards giv­ing more im­por­tance to the data they can fit. It’s un­clear to me how this af­fects us, my de­fault as­sump­tion would be that it’s nega­tive, as with most other ar­bi­trary con­straints that you can’t es­cape.

6. Named Distri­bu­tions al­low for the us­age of cer­tain math­e­mat­i­cal ab­strac­tions with ease, but it’s un­clear to me whether or not this sim­plifi­ca­tion of com­pu­ta­tions is needed un­der any cir­cum­stance. For ex­am­ple, it takes a few mil­lisec­onds to com­pute the p-value for any point on any ar­bi­trary PDF us­ing python, so we don’t re­ally need to have p-value ta­bles for spe­cific sig­nifi­cance tests and their as­so­ci­ated null hy­poth­e­sis dis­tri­bu­tions.

## Epilogue

The clock tower strikes 12, moon­light bursts into a room, find­ing Abra­ham de Moivre sit­ting at his desk. He’s com­put­ing the mean and stan­dard de­vi­a­tion for the heights of the 7 mem­bers of his house­hold, won­der­ing if this can be used as the ba­sis for the nor­mal dis­tri­bu­tion of all hu­man height.

The on­go­ing thun­der­storm con­sti­tutes an om­i­nous enough sce­nario, that, when light­ning strikes, a man in an alu­minum retro suit from a 70s space-pop mu­sic video ap­pears.

Stop what­ever you are do­ing ! says the time trav­eler. …

• Look, this Gaus­sian dis­tri­bu­tion stuff you’re do­ing, it seems great, but it’s go­ing to lead many sci­ences astray in the fu­ture.

• Gaus­sian?

• Look, it doesn’t mat­ter. The thing is that a lot of ar­eas of mod­ern sci­ence have put a few dis­tri­bu­tions, in­clud­ing your stan­dard dis­tri­bu­tion, on a bit of a pedestal. Re­searchers in medicine, psy­chol­ogy, so­ciol­ogy, and even some ar­eas of chem­istry and physics must as­sume that one out of a small set of dis­tri­bu­tions can model the PDF of any­thing, or their pa­pers don’t will never get pub­lished.

… Not sure I’m fol­low­ing.

• I know you have good rea­son to use this dis­tri­bu­tion now, but in the fu­ture, we’ll have com­put­ers that can do calcu­la­tions and gather data billions of times faster than a hu­man, so all of this ab­strac­tion, by then built into the very fiber of math­e­mat­ics, will just be con­found­ing to us.

• So then, why not stop us­ing it?

• Look, be­hav­ioral eco­nomics hasn’t been in­vented yet, so I don’t have time to ex­plain. It has to do with the fact that once a stan­dard is en­trenched enough, even if it’s bad, ev­ery­one must con­form to it be­cause defec­tors will be seen as hav­ing some­thing to hide.

• Ok, but, what ex...

The door swings open and Leib­niz pops his head in

• You men­tioned in this fu­ture there are ma­chines that can do a near-in­finite num­ber of com­pu­ta­tions.

• Yeah, you could say that.

• Was our math­e­matic not in­volved in cre­at­ing those ma­chines? Say, the CLT de Moivre is work­ing on now?

• I guess so, the ad­vent of molec­u­lar physics would not have been pos­si­ble with­out the nor­mal dis­tri­bu­tions to model the be­hav­ior of var­i­ous par­ti­cles.

• So then, if the CLT had not ex­ited, the knowl­edge of mater needed to con­struct your com­pu­ta­tion ma­chines in the first place wouldn’t have been, and pre­sum­ably you wouldn’t have been here warn­ing us about the CLT in the first place.

• Hmph.

• You see, the in­ven­tion of the CLT, whilst bring­ing with it some evil, is still the best pos­si­ble turn of events one could hope for, as we are liv­ing in the best of all pos­si­ble wor­lds.

• I just learned about ex­po­nen­tial fam­i­lies which seem to en­com­pass most of the dis­tri­bu­tions we have named. It seems like the dis­tri­bu­tions we name, then, do sort of form this nat­u­ral group with a lot of nice prop­er­ties.

• This post has a lot of mis­con­cep­tions in it.

Let’s start with the ap­pli­ca­tion of the cen­tral limit the­o­rem to cham­pagne drinkers. First, there’s the dis­tinc­tion be­tween “liver weights are nor­mally dis­tributed” and “mean of a sam­ple of liver weights is nor­mally dis­tributed”. The lat­ter is much bet­ter-jus­tified, since we com­pute the mean by adding a bunch of (pre­sum­ably in­de­pen­dent) ran­dom vari­ables to­gether. And the lat­ter is usu­ally what we ac­tu­ally use in ba­sic anal­y­sis of ex­per­i­men­tal data—e.g. to de­cide whether there’s a sig­nifi­cant differ­ent be­tween the cham­pagne-drink­ing group and the non-cham­pagne-drink­ing group. That does not re­quire that liver weights them­selves be nor­mally dis­tributed.

That said, the CLT does provide rea­son to be­lieve that some­thing like liver weight would be nor­mally dis­tributed, but the OP omits a key piece of that ar­gu­ment: lin­ear ap­prox­i­ma­tion. You do men­tion this briefly:

Most pro­cesses don’t “ac­cu­mu­late” but rather con­tribute in weird ways to yield their re­sult. Growth rate of straw­ber­ries is f(lumens, water) but if you as­sume you can ap­prox­i­mate f as lumens*a + water*b you’ll get some re­ally weird situ­a­tion where your straw­ber­ries die in a very damp cel­lar or wither away in a desert.

… but that’s not quite the whole ar­gu­ment, so let’s go through it prop­erly. The ar­gu­ment for nor­mal­ity is that f is ap­prox­i­mately lin­ear over the range of typ­i­cal vari­a­tion of its in­puts. So, if (in some strange units) lu­mens vary be­tween 2 and 4, and wa­ter varies be­tween 0.3 and 0.5, then we’re in­ter­ested in whether f is ap­prox­i­mately lin­ear within that range. Ex­tend this ar­gu­ment to more vari­ables, ap­ply CLT (sub­ject to con­di­tions), and we get a nor­mal dis­tri­bu­tion. What hap­pens in damp cel­lar or desert is not rele­vant un­less those situ­a­tions are within the nor­mal range of vari­a­tion of our in­puts (e.g. within some par­tic­u­lar dataset).

(The OP also com­plains that “We can’t de­ter­mine the in­ter­val for which most pro­cesses will yield val­ues”. This is not nec­es­sar­ily a prob­lem; there’s like a gazillion ver­sions of the CLT, and not all of them de­pend on bound­ing pos­si­ble val­ues. CLT for e.g. the Cauchy dis­tri­bu­tion even works for in­finite var­i­ance.)

Now, a bet­ter ar­gu­ment against the CLT is this one:

Most pro­cesses in the real world, es­pe­cially pro­cesses that con­tribute to the same out­come, are in­ter­con­nected.

Even here, we can ap­ply a lin­ear­ity → nor­mal­ity ar­gu­ment as long as the er­rors are small rel­a­tive to cur­va­ture. We model some­thing like a metabolic net­work as a steady-state with some noise : . For small , we lin­earize and find that , where I is an iden­tity ma­trix and the par­tials are ma­tri­ces of par­tial deriva­tives. Note that this whole thing is lin­ear in , so just like be­fore, we can ap­ply the CLT (sub­ject to con­di­tions), and find that the dis­tri­bu­tions of each are roughly nor­mal.

Take­away: in prac­tice, the nor­mal ap­prox­i­ma­tion via CLT is re­ally about noise be­ing small rel­a­tive to func­tion cur­va­ture. It’s mainly a lin­ear ap­prox­i­ma­tion over the typ­i­cal range of the noise.

Next up, tri­an­gles.

“The left-hand side is Py­athagor’s for­mula, the right-hand side is this ar­ti­fact which is kind of use­ful, but there’s no prop­erty of our math­e­mat­ics that ex­actly define what a slightly-off right-an­gled tri­an­gle is or tells us it should fit this rule.”

There ab­solutely is a prop­erty of math­e­mat­ics that tells us what a slightly-off right-an­gled tri­an­gle is: it’s a tri­an­gle which satis­fies Pythago­ras’ for­mula, to within some un­cer­tainty. This is not tau­tolog­i­cal; it makes falsifi­able pre­dic­tions about the real world when two tri­an­gles share the same right-ish cor­ner. For in­stance, I could grab a piece of printer pa­per and draw two differ­ent di­ag­o­nal lines be­tween the left edge and the bot­tom edge, defin­ing two al­most-right tri­an­gles which share their cor­ner (the cor­ner of the pa­per). Now I mea­sure the sides of one of those two tri­an­gles very pre­cisely, and find that they satisfy Pythago­ras’ rule to within high pre­ci­sion—there­fore the cor­ner is very close to a right an­gle. Based on that, I pre­dict that lower-pre­ci­sion mea­sure­ments of the sides of the other tri­an­gle will also be within un­cer­tainty of satis­fy­ing Pythago­ras’ rule.

On to the next sec­tion...

I think [Named Distri­bu­tions] can also be dan­ger­ous be­cause of their sim­plic­ity, that is, the lack of pa­ram­e­ters that can be tuned when fit­ting them.
Some peo­ple seem to think it’s in­her­ently bad to use com­plex mod­els in­stead of sim­ple ones when avoid­able, I can’t help but think that the peo­ple say­ing this are the same as those that say you shouldn’t quote wikipe­dia.

I fully sup­port quot­ing Wikipe­dia, and it is in­her­ently bad to use com­plex mod­els in­stead of sim­ple ones when avoid­able. The rele­vant ideas are in chap­ter 20 of Jaynes’ Prob­a­bil­ity The­ory: The Logic of Science, or you can read about Bayesian model com­par­i­son.

In­tu­itively, it’s the same idea as con­ser­va­tion of ex­pected ev­i­dence: if one model pre­dicts “it will definitely be sunny to­mor­row” and an­other model pre­dicts “it might be sunny or it might rain”, and it turns out to be sunny, then we must up­date in fa­vor of the first model. In gen­eral, when a com­plex model is con­sis­tent with more pos­si­ble datasets than a sim­ple model, if we see a dataset which is con­sis­tent with the sim­ple model, then we must up­date in fa­vor of the sim­ple model. It’s that sim­ple. Bayesian model com­par­i­son quan­tifies that idea, and gives a more pre­cise trade­off be­tween qual­ity-of-fit and model com­plex­ity.

• And the lat­ter is usu­ally what we ac­tu­ally use in ba­sic anal­y­sis of ex­per­i­men­tal data—e.g. to de­cide whether there’s a sig­nifi­cant differ­ent be­tween the cham­pagne-drink­ing group and the non-cham­pagne-drink­ing group

I never bought up null-hy­poth­e­sis test­ing in the liver weight ex­am­ple and it was not meant to illus­trate that… hence why I never bought up the idea of sign­fi­ance.

Mind you, I dis­agree that sign­fi­cance test­ing is done cor­rectly, but this is not the ar­gu­ment against it nor is it re­lated to it.

(The OP also com­plains that “We can’t de­ter­mine the in­ter­val for which most pro­cesses will yield val­ues”. This is not nec­es­sar­ily a prob­lem; there’s like a gazillion ver­sions of the CLT, and not all of them de­pend on bound­ing pos­si­ble val­ues. CLT for e.g. the Cauchy dis­tri­bu­tion even works for in­finite var­i­ance.)

My ar­gu­ment is not that you can’t come up with a dis­tri­bu­tion for ev­ery lit­tle edge case imag­in­able, my ar­gu­ment is ex­actly that you CAN and you SHOULD but this pro­cess should be done au­to­mat­i­cally, be­cause ev­ery sin­gle prob­lem is differ­ent and we have the means to dy­nam­i­cally see the model that best suits ev­ery prob­lem rather than stick to choos­ing be­tween e.g. 60 names dis­tri­bu­tions.

Even here, we can ap­ply a lin­ear­ity → nor­mal­ity ar­gu­ment as long as the er­rors are small rel­a­tive to cur­va­ture.

I fail to see your ar­gu­ment here, as in, I fail to see how it deals with the in­ter­con­nected bit of my ar­gu­ment and I fail to see how noise be­ing small is some­thing that ever hap­pens in a real sys­tem, in the sense you use it here, as in, noise be­ing ev­ery­thing that’s not in­fer­ence we are look­ing for.

There ab­solutely is a prop­erty of math­e­mat­ics that tells us what a slightly-off right-an­gled tri­an­gle is: it’s a tri­an­gle which satis­fies Pythago­ras’ for­mula, to within some un­cer­tainty.

But, by this defi­ni­tion that you use here, any ar­bi­trary thing I want to define math­e­mat­i­cally, even if it con­tains within it some amount of hand wavy­ness or un­cer­tainty, can be a prop­erty of math­e­mat­ics ?

I fully sup­port quot­ing Wikipe­dia, and it is in­her­ently bad to use com­plex mod­els in­stead of sim­ple ones when avoid­able. The rele­vant ideas are in chap­ter 20 of Jaynes’ Prob­a­bil­ity The­ory: The Logic of Science, or you can read about Bayesian model com­par­i­son.

Your ar­ti­cle seems to have some as­sump­tion that in­crease com­plex­ity == prone­ness to overfit­ting.

Which in it­self is true if you aren’t val­i­dat­ing the model, but if you aren’t val­i­dat­ing the model it seems to me that you’re not even in the cor­rect game.

If you are val­i­dat­ing the model, I don’t see how the ar­gu­ment holds (will look into the book to­mor­row if I have time)

In­tu­itively, it’s the same idea as con­ser­va­tion of ex­pected ev­i­dence: if one model pre­dicts “it will definitely be sunny to­mor­row” and an­other model pre­dicts “it might be sunny or it might rain”, and it turns out to be sunny, then we must up­date in fa­vor of the first model. In gen­eral, when a com­plex model is con­sis­tent with more pos­si­ble datasets than a sim­ple model, if we see a dataset which is con­sis­tent with the sim­ple model, then we must up­date in fa­vor of the sim­ple model. It’s that sim­ple. Bayesian model com­par­i­son quan­tifies that idea, and gives a more pre­cise trade­off be­tween qual­ity-of-fit and model com­plex­ity.

I fail to un­der­stand this ar­gu­ment and I did pre­vi­ously read the ar­ti­cle men­tioned here, but maybe it’s just a func­tion of it be­ing 1AM here, I will try again to­mor­row.

• Let’s start with the ap­pli­ca­tion of the cen­tral limit the­o­rem to cham­pagne drinkers. First, there’s the dis­tinc­tion be­tween “liver weights are nor­mally dis­tributed” and “mean of a sam­ple of liver weights is nor­mally dis­tributed”. The lat­ter is much bet­ter-jus­tified, since we com­pute the mean by adding a bunch of (pre­sum­ably in­de­pen­dent) ran­dom vari­ables to­gether. And the lat­ter is usu­ally what we ac­tu­ally use in ba­sic anal­y­sis of ex­per­i­men­tal data—e.g. to de­cide whether there’s a sig­nifi­cant differ­ent be­tween the cham­pagne-drink­ing group and the non-cham­pagne-drink­ing group. That does not re­quire that liver weights them­selves be nor­mally dis­tributed.

I think your state­ment in bold font is wrong. I think in cases such as cham­pagne drinkers vs non-cham­pagne-drinkers peo­ple are likely to use Stu­dent’s two-sam­ple t-test or Welch’s two-sam­ple un­equal var­i­ances t-test. It as­sumes that in both groups, each sam­ple is dis­tributed nor­mally, not that the means are dis­tributed nor­mally.

• No, the stu­dent’s two-sam­ple t-test does not re­quire that in­di­vi­d­ual sam­ples are dis­tributed uniformly. You cer­tainly could de­rive it that way, but it’s not a nec­es­sary as­sump­tion. All it ac­tu­ally needs is nor­mal­ity of the group means via CLT—see e.g. here.

• When we teach peo­ple the for­mula we should draw at­ten­tion to the differ­ence: “The left-hand side is Py­athagor’s for­mula, the right-hand side is this ar­ti­fact which is kind of use­ful, but [1] there’s no prop­erty of our math­e­mat­ics that ex­actly define what a slightly-off right-an­gled tri­an­gle is or [2] tells us it should fit this rule”.

[1] There prob­a­bly is. (Though the idea that real tri­an­gles are ex­actly like math­e­mat­i­cal tri­an­gles and that can be proved via logic might be wrong.)

[2] And it says tells you ex­actly how wrong the rule is based on what the tri­an­gle is ac­tu­ally like.

Well, that as­sumes the phe­nomenon fits a 2,000,000 pa­ram­e­ter equa­tion us­ing some com­bi­na­tion of the finite set of op­er­a­tions pro­vided by the net­work (e.g. +, -, >, ==, max, min,avg).

Or that a 2,000,000 pa­ram­e­ter equa­tion will make a good ap­prox­i­ma­tion. (I’m not sure if that’s what you meant by “fit”.) If you have some as­sump­tions, and use math cor­rectly to find that the height of some­thing is 4 ft, but it’s ac­tu­ally 5 ft, then the as­sump­tions aren’t a perfect fit.

So I’m go­ing to dis­re­gard it for now; if any­one knows of a good ar­ti­cle/​pa­per/​book ar­gu­ing for why sim­ple mod­els are in­her­ently bet­ter, please tell me and I’ll link to it here.

Sup­pose I have 100 dat­a­points and I come with a polyno­mial that fits all of them, with “de­gree 99”. How close do you think that the polyno­mial is to the real func­tion? Even if I the dat­a­points are all 100% ac­cu­rate, and the real func­tion is a polyno­mial, there is no re­dun­dancy at all. Whereas if the polyno­mial was of de­gree 3, then 4 points is enough to come up with the rule, and the other 96 points just ver­ify it (within it’s paradigm). When there’s no re­dun­dancy the “ev­ery­thing is a polyno­mial” paradigm doesn’t seem jus­tified. When there’s 96 re­dun­dant points, out of 100 points, it seems like polyno­mi­als are a re­ally good fit.

(In other words, it’s not clear how a com­pli­cated model com­presses rather than obfus­cates the data—though what is “com­pli­cated” is a func­tion of the data available.)

We have mod­els that can ac­count for more com­plex­ity, with­out any clear dis­ad­van­tages, so it’s un­clear to me why we wouldn’t use those.

This ar­ti­cle fo­cused heav­ily on Named Distri­bu­tions, and not a lot on these al­ter­na­tives. (NNs were men­tioned in pass­ing.)

You see, the in­ven­tion of the CLT, whilst bring­ing with it some evil, is still the best pos­si­ble turn of events one could hope for, as we are liv­ing in the best of all pos­si­ble wor­lds.

That sounds like an in­ter­est­ing bit of his­tory.

• Here is why you use sim­ple mod­els.

The blue crosses are the data. The red line is the line of best fit. The black line is a polyno­mial of de­gree 50 of best fit. High di­men­sional mod­els have a ten­dency to fit the data by wig­gling wildly.

• That prob­lem would be han­dled by cross-val­i­da­tion; the OP is say­ing that a sim­ple model doesn’t have an ob­vi­ous ad­van­tage as­sum­ing that both val­i­date.

Given that both mod­els val­i­date, the main rea­son to pre­fer a sim­pler model is the sort of thing in Gears vs Be­hav­ior: the sim­pler model is more likely to con­tain phys­i­cally-re­al­is­tic in­ter­nal struc­ture, to gen­er­al­ize be­yond the test­ing/​train­ing sets, to han­dle dis­tri­bu­tion shifts, etc.

• It de­pends on what cross val­i­da­tion you are us­ing. I would ex­pect com­plex mod­els to rarely cross val­i­date.