Named Distributions as Artifacts

Link post

On con­fus­ing pri­ors with models

Be­ing Abra­ham de Moivre and be­ing born in the 17th cen­tury must have been a re­ally sad state of af­fairs. For one, you have the bubonic plague thing go­ing on, but even worse for de Moivre, you don’t have com­put­ers and sen­sors for au­to­mated data col­lec­tion.

As some­one in­ter­ested in com­plex-real world pro­cesses in the 17th cen­tury, you must col­lect all of your ob­ser­va­tions man­u­ally, or waste your tiny for­tune on pay­ing un­re­li­able work­ers to col­lect it for you. This is made even more com­pli­cated by the whole “dan­ger­ous to go out­side be­cause of the bubonic plague” thing.

As some in­ter­ested in chance and ran­dom­ness in the 17th cen­tury, you’ve got very few mod­els of “ran­dom” to go by. Sam­pling a uniform ran­dom vari­able is done by liter­ally rol­ling dice or toss­ing a coin. Dice which are im­perfectly crafted, as to not be quite uniform. A die (or coin) which limits the pos­si­bil­ities of your ran­dom vari­able be­tween 2 and 20 (most likely 12, but I’m not dis­count­ing the ex­is­tence of me­dieval d20s). You can or­der cus­tom dice, you can roll mul­ti­ple dice and com­bine the re­sults into a num­ber (e.g. 4d10 are enough to sam­ple uniformly from be­tween 1 and 10,000), but it be­comes in­creas­ingly te­dious to gen­er­ate larger num­bers.

Even more hor­ribly, once you te­diously gather all this data, your anal­y­sis of it is limited to pen and pa­per, you are limited to 2 plot­ting di­men­sions, maybe 3 if you are a par­tic­u­larly good artist. Every time you want to vi­su­al­ize 100 sam­ples a differ­ent way you have to care­fully draw a co­or­di­nate sys­tem and te­diously draw 100 points. Every time you want to check if an equa­tion fits your 100 sam­ples, you must calcu­late the equa­tion 100 times… and then you must com­pute the differ­ence be­tween the equa­tion re­sults and your sam­ples, and then man­u­ally use those to com­pute your er­ror func­tion.

I in­vite you to pick two ran­dom dice and throw them un­til you have a vague ex­per­i­men­tal proof that the CLT holds, and then think about the fact that this must have been de Moivrea’s day to day life study­ing proto-prob­a­bil­ity.

So it’s by no ac­ci­dent, mis­for­tune, or ill-think­ing that de Moivre came up with the nor­mal dis­tri­bu­tion [trade­marked by Gauss] and the cen­tral limit the­o­rem (CLT).

A prim­i­tive pre­cur­sor of the CLT is the idea that when two or more in­de­pen­dent ran­dom vari­ables are uniform on the same in­ter­val, their sum will be defined by a nor­mal dis­tri­bu­tion (aka Gaus­sian dis­tri­bu­tion, bell curve).

This is re­ally nice be­cause a nor­mal dis­tri­bu­tion is a rel­a­tively sim­ple func­tion of two vari­ables (mean and var­i­ance), it gives de Moivre a very malle­able math­e­mat­i­cal ab­strac­tion.

How­ever, de Moivre’s defi­ni­tion of the CLT was that, when in­de­pen­dent ran­dom vari­ables (which needn’t be uniform) are nor­mal­ized to the same in­ter­val and then summed, this sum will tend to­wards the nor­mal dis­tri­bu­tion. That is to say, some given nor­mal dis­tri­bu­tion will be able to fit their sum fairly well.

The ob­vi­ous is­sue in this defi­ni­tion is the “tend to­wards” and the “ran­dom” and the “more than two”. There’s some re­la­tion be­tween these, in that, the fewer ways there are for some­thing to be “ran­dom” and the more of these vari­ables one has, the “closer” the sum will be to a nor­mal dis­tri­bu­tion.

This can be for­mal­ized to fur­ther nar­row down the types of ran­dom vari­ables from which one can sam­ple in or­der for the CLT to hold. Once the na­ture of the ran­dom gen­er­a­tors is known, the num­ber of ran­dom vari­ables and sam­ples needed to get within a cer­tain er­ror range of the clos­est pos­si­ble nor­mal dis­tri­bu­tion can also be ap­prox­i­mated fairly well.

But that is beside the point, the pur­pose of the CLT is not to wax lyri­cally about ideal­ized ran­dom pro­cesses, it’s to be able to study the real world.

De Moivre’s father, a proud pa­triot sur­geon, wants to prove the sparkling wines of Cham­pagne have a for­tify­ing effects upon the liver. How­ever, do­ing au­top­sies is hard, be­cause peo­ple don’t like hav­ing their fam­ily corpses “des­e­crated” and be­cause there’s a plague. Still, the good sur­geon gets a sam­pling of livers from nor­mal French­men and from Cham­pagne drink­ing French­men and mea­sures their weight.

Now, nor­mally, he could sum up these weights and com­pare the differ­ence, or look at the small­est and largest weight from each set to figure out the differ­ence be­tween the ex­tremes.

But the sam­ples are very small, and might not be fully rep­re­sen­ta­tive. Maybe, by chance, the heav­iest Cham­pagne drinker liver he was is 2.5kg, but the heav­iest pos­si­ble Cham­pagne drinker liver is 4kg. Maybe the Cham­pagne drinker livers are over­all heav­ier, but there’s one ex­tremely heavy in the con­trol group.

Well, armed with the Nor­mal Distri­bu­tion and the CLT we say:

Liver weight is prob­a­bly a func­tion of var­i­ous un­der­ly­ing pro­cesses and we can as­sume they are fairly ran­dom. Since we can think of them as sum­ming up to yield the liver’s weight, we can as­sume these weights will fall onto a nor­mal dis­tri­bu­tion.

Thus, we can take a few ob­ser­va­tions about the weights:

Nor­mal French­men: 1.5, 1.7, 2.8, 1.6, 1.8, 1.2, 1.1, 1, 2, 1.3, 1.8, 1.5, 1.3, 0.9, 1, 1.1, 0.9, 1.2

Cham­pagne drink­ing French­men: 1.7, 2.5, 2.5, 1.2, 2.4, 1.9, 2.2, 1.7

And we can com­pute the mean and the stan­dard de­vi­a­tion and then we can plot the nor­mal dis­tri­bu­tion of liver weight (cham­pagne drinkers in or­ange, nor­mal in blue):

So, what’s the max­i­mum weight of a Cham­pagne drinker’s liver? 3.3kg

Nor­mal liver? 2.8

Min­i­mum weight? 0.7kg cham­pagne, 0.05kg nor­mal (maybe con­founded by ba­bies that are un­able to drink Cham­pagne, in­ves­ti­gate fur­ther)

But we can also an­swer ques­tion like:

For nor­mal peo­ple, say, 910 peo­ple, the ones that aren’t at the ex­treme, what’s the range of cham­pagne drink­ing vs nor­mal liver weights?

Or ques­tions like:

Given liver weight x, how cer­tain are we this per­son drank Cham­pagne?

Maybe ques­tions like:

Given that I sus­pect my kid has a tiny liver, say 0.7kg, will drink­ing Cham­pagne give him a 45 chance of for­tify­ing it to 1.1kg?

So the CLT seems amaz­ing, at least it seems amaz­ing un­til you take a look at the real world, where:

a) We can’t de­ter­mine the in­ter­val for which most pro­cesses will yield val­ues. You can study the stock mar­ket from 2010 to 2019, see that the DOW daily change is be­tween −3% and +4% and fairly ran­dom, as­sume the DOW price change is gen­er­ated by a ran­dom func­tion with val­ues be­tween −3% and 4%… and then March 2020 rolls around and sud­denly that func­tion yields −22% and you’re ****.

b) Most pro­cesses don’t “ac­cu­mu­late” but rather con­tribute in weird ways to yield their re­sult. Growth rate of straw­ber­ries is f(lumens, water)but if you as­sume you can ap­prox­i­mate f as lumens*a + water*b you’ll get some re­ally weird situ­a­tion where your straw­ber­ries die in a very damp cel­lar or wither away in a desert.

c) Most pro­cesses in the real world, es­pe­cially pro­cesses that con­tribute to the same out­come, are in­ter­con­nected. Ever won­der why nu­tri­tion­ists are sel­dom able to make any con­sis­tent or im­pact­ful claims about an ideal diet? Well, maybe it’ll make more sense if you look at this di­a­gram of metabolic path­way. In the real world, most pro­cesses that lead to a shared out­come are so in­ter­con­nected it’s a her­culean task to even res­cue the pos­si­bil­ity of think­ing causally.

On a sur­face level, my in­tu­itions tell me it’s wrong to make such as­sump­tions, the epistem­i­cally cor­rect way to think about mak­ing in­fer­ences from data ought to go along the lines of:

  • You can use the data at your dis­posal, no more

  • If you want more data you can ei­ther col­lect more or find a causal mechanism, use it to ex­trap­o­late, prove that it’s true even at edge cases by mak­ing truth­ful pre­dic­tions and use that


  • If you want more data, you can ex­trap­o­late given that you know of a very similar dataset to back up your ex­trap­o­la­tion. E.g. if we know the dis­tri­bu­tion of English­men liver sizes, we can as­sume the dis­tri­bu­tion of French­men liver sizes might fit that shape. But this ex­trap­o­la­tion is just a bet­ter-than-ran­dom hy­poth­e­sis, to be scru­ti­nized fur­ther based on the pre­dic­tions stem­ming from it.

Granted, your mileage may vary de­pend­ing on your area of study. But no mat­ter how laxly you in­ter­pret the above, there’s a huge leap to be made from those rules to as­sum­ing the CLT holds on real-world pro­cesses.

But, again, this is the 17th cen­tury, there are no com­put­ers, no fancy data col­lec­tion ap­para­tus, there’s a plague go­ing on and to be fair, loads of things, from the height of peo­ple to grape yield of a vine, seem to fit the nor­mal dis­tri­bu­tion quite well.

De Moivre didn’t have even a minute frac­tion of the op­tions we have now, all things con­sid­ered, he was rather in­ge­nious.

But lest we for­get, the nor­mal dis­tri­bu­tion came about as a tool to:

  • Re­duce com­pu­ta­tion time, in a world where peo­ple could only use pen and pa­per and their own mind.

  • Make up for lack of data, in a world where “data” meant what one could see around him plus a few dozens of books with some ta­bles at­tached to them.

  • Allow tech­niques that op­er­ate on func­tions, to be ap­plied to sam­ples (by ab­stract­ing them as a nor­mal dis­tri­bu­tion). In other words, al­low for con­tin­u­ous math to be ap­plied to a dis­crete world.

Ar­ti­facts that be­came idols

So, the CLT yields a type of dis­tri­bu­tion that we in­tuit is good for mod­el­ing a cer­tain type of pro­cess and has a bunch of use­ful math­e­mat­i­cal prop­er­ties (in a pre-com­puter world).

We can take this idea, ei­ther by tweak­ing the CLT or by throw­ing it away en­tirely and start­ing from scratch, to cre­ate a bunch of similarly use­ful dis­tri­bu­tion.… a re­ally re­ally large bunch.

For the pur­pose of this ar­ti­cle I’ll re­fer to these as “Named Distri­bu­tions”. They are dis­tri­bu­tions in­vented by var­i­ous stu­dents of math­e­mat­ics and sci­ence through­out his­tory, to tackle some prob­lem they were con­fronting, which proved to be use­ful enough that they got adopted into the reper­toire of com­monly used tools.

All of these dis­tri­bu­tions are se­lected based on two crite­ria:

a) Their math­e­mat­i­cal prop­er­ties, things that al­low statis­ti­ci­ans to more eas­ily ma­nipu­late them, both in terms of com­plex­ity-of-think­ing re­quired to for­mu­late the ma­nipu­la­tions and in terms of com­pu­ta­tion time. Again, re­mem­ber, these dis­tri­bu­tions came about in the age of pen and pa­per … and maybe some ana­log com­put­ers which were very ex­pen­sive and slow to use).

b) Their abil­ity to model pro­cesses or char­ac­ter­is­tics of pro­cesses in the real world.

Based on these two crite­ria we get dis­tri­bu­tions like Stu­dent’s t, the chi-square, the f dis­tri­bu­tion, Bernoulli dis­tri­bu­tion, and so on.

How­ever, when you split the crite­ria like so, it’s fairly easy to see the in­her­ent prob­lem of us­ing Named Distri­bu­tions, they are a con­founded product be­tween:

a) Math­e­mat­i­cal mod­els se­lected for their ease of use b) Pri­ors about the un­der­ly­ing re­al­ity those math­e­mat­ics are ap­plied to

So, take some­thing like Pythagora’s the­o­rem about a square tri­an­gle: h^2 = l1^2 + l2^2.

But, as­sume we are work­ing with im­perfect real-world ob­jects, our “tri­an­gles” are never ideal. A statis­ti­cian might look at 100 tri­an­gu­lar shapes with a square an­gle and say “huh, they seem to vary by at most this much from an ideal 90-de­gree an­gle)”, based on that, he could come up with a for­mula like:

h^2 = l1^ + l2^2 +/- 0.05*max(l1^2,l2^2)

And this might in­deed be a use­ful for­mula, at least if the ob­ser­va­tions of the statis­ti­cian gen­er­al­ize to all right an­gle that ap­peared in the real world, but a THEOREM or a RULE it is not, not so un­der any math­e­mat­i­cal sys­tem from Eu­clid to Rus­sell.

If the hy­potenuse of real-world square tri­an­gle proves to be usu­ally within the bound pre­dicted by this for­mula, and the for­mula gives us a mar­gin of er­ror that doesn’t in­con­ve­nience fur­ther calcu­la­tions we might plug it into, then it may well be use­ful.

But it doesn’t change the fact that to make a state­ment about a “kind of square” tri­an­gle, we need to mea­sure 3 of its pa­ram­e­ters cor­rectly, the for­mula is not a sub­sti­tute for re­al­ity, it’s just a sim­plifi­ca­tion of re­al­ity when we have im­perfect data.

Nor should we lose the origi­nal the­o­rem in fa­vor of this one, be­cause the origi­nal is ac­tu­ally mak­ing a true state­ment about a shared math­e­mat­i­cal world, even if it’s more limited in its ap­pli­ca­bil­ity.

We should re­mem­ber that 0.05*max(l1^2,l2^2)
is an ar­ti­fact stem­ming from the statis­ti­cian’s origi­nal mea­sure­ments, it shouldn’t be­come a holy num­ber, never to change for any rea­son.

If we agree this ap­plied ver­sion of the Pythago­ras the­o­rem is in­deed use­ful, then we should fre­quently mea­sure the right an­gle in the real world, and figure out if the “magic er­ror” has to be changed to some­thing else.

When we teach peo­ple the for­mula we should draw at­ten­tion to the differ­ence: “The left-hand side is Py­athagor’s for­mula, the right-hand side is this ar­ti­fact which is kind of use­ful, but there’s no prop­erty of our math­e­mat­ics that ex­actly define what a slightly-off right-an­gled tri­an­gle is or tells us it should fit this rule”.

The prob­lem with Named Distri­bu­tions is that one can’t re­ally point out the differ­ence be­tween math­e­mat­i­cal truths and the pri­ors within them. You can take the F dis­tri­bu­tion and say “This ex­ists be­cause it’s easy to plug it into a bunch of calcu­la­tion AND be­cause it pro­vides some use­ful pri­ors for how real-world pro­cesses usu­ally be­haves”.

You can’t re­ally ask the ques­tion “How would the equa­tion defin­ing the F-dis­tri­bu­tion change, pro­vided that those pri­ors about the real world would change”. To be more pre­cise, we can ask the ques­tion, but the re­sult would be fairly com­pli­cated and the re­sult­ing F’-dis­tri­bu­tion might hold none of the use­ful math­e­mat­i­cal prop­er­ties of the F-dis­tri­bu­tion.

Are Named Distri­bu­tions harm­ful?

Hope­fully I’ve man­aged to get across the in­tu­ition for why I don’t re­ally like Named Distri­bu­tions, they are very bad at sep­a­rat­ing be­tween model and prior.

But… on the other hand, ANY math­e­mat­i­cal model we use to ab­stract the world will have a built-in prior.

Want to use a re­gres­sion? Well, that as­sumes the phe­nomenon fits a straight line.

Want to use a 2,000,000 billion pa­ram­e­ter neu­ral net­work? Well, that as­sumes the phe­nomenon fits a 2,000,000 pa­ram­e­ter equa­tion us­ing some com­bi­na­tion of the finite set of op­er­a­tions pro­vided by the net­work (e.g. +, -, >, ==, max, min,avg).

From that per­spec­tive, the pri­ors of any given Named Distri­bu­tions are no differ­ent than the pri­ors of any par­tic­u­lar neu­ral net­work or de­ci­sion tree or SVM.

How­ever, they come to be harm­ful be­cause peo­ple make as­sump­tions about their pri­ors be­ing some­what cor­rect, whilst as­sum­ing the pri­ors of other statis­ti­cal mod­els are some­what in­fe­rior.

Say, for ex­am­ple, I ran­domly define a dis­tri­bu­tion G
and this dis­tri­bu­tion hap­pens to be amaz­ing at mod­el­ing peo­ple’s IQs. I take 10 ho­moge­nous groups of peo­ple and I give them IQ tests, I then fit G
on 50% of the re­sults for each group and then I see if this cor­rectly mod­els the other 50%… it works.

I go fur­ther, I val­i­date G
us­ing k-fold cross-val­i­da­tion and it gets tremen­dous re­sults no mat­ter what folds I end up test­ing it on. Point is, I run a se­ries of tests that, to the best of our cur­rent un­der­stand­ing of event mod­el­ing, shows that G
is bet­ter at mod­el­ing IQ than the nor­mal dis­tri­bu­tion.

Well, one would as­sume I could then go ahead and in­form the neu­rol­ogy/​psy­chol­ogy/​eco­nomics/​so­ciol­ogy com­mu­nity:

Hey, guys, good news, you know how you’ve been us­ing the nor­mal dis­tri­bu­tion for mod­els of IQ all this time… well, I got this new dis­tri­bu­tion G
that works much bet­ter for this spe­cific data type. Here is the data I val­i­dated it on, we should run some more tests and, if they turn up pos­i­tive, start us­ing it in­stead of the nor­mal dis­tri­bu­tion.

In semi-ideal­ized re­al­ity, this might well re­sult in a co­or­di­na­tion prob­lem. That is to say, even as­sum­ing I was talk­ing with math­e­mat­i­cally apt peo­ple that un­der­stood the nor­mal dis­tri­bu­tion not to be magic, they might say:

Look, your G
is in­deed x% bet­ter than the nor­mal dis­tri­bu­tion at mod­el­ing IQ, but I don’t re­ally think x% is enough to ac­tu­ally af­fect any calcu­la­tions, and it would mean scrap­ping or re-writ­ing a lot of old pa­pers, wast­ing our col­leagues time to spread this in­for­ma­tion, try­ing to ex­plain this thing to jour­nal­ists that at least kind of un­der­stand what a bell curve is …etc.

In ac­tual re­al­ity, I’m afraid the an­swer I’d get is some­thing like:

What? This can’t be true, you are ly­ing and or mi­s­un­der­stand­ing stat­ics. All my text­books say that the stan­dard dis­tri­bu­tion mod­els IQ and is what should be used for that, there’s even a thing called the Cen­tral limit the­o­rem prov­ing it, it’s right there in the name “T-H-E-O-R-E-M” that means it’s true.

Maybe the ac­tual re­ac­tion lies some­where be­tween my two ver­sions of the real world, it’s hard to tell. If I were a so­ciol­o­gist I’d sus­pect there’d be some stan­dard set of re­ac­tions μ and they vary by some stan­dard re­ac­tion vari­a­tion co­effi­cient σ
, and 68.2% of re­ac­tions are within the μ +/-σ range, 95.4% are within the μ +/- 2σ range, 99.6% are within the μ +/-3σ range and the rest are no rele­vant for pub­lish­ing in jour­nals which would benefit my tenure ap­pli­ca­tion so they don’t ex­ist. Then I would go ahead and find some way to define μ and σ such that this would hold true.

Alas, I’m not a so­ciol­o­gist, so I have only empty em­piri­cal spec­u­la­tion about this hy­po­thet­i­cal.

Other than a pos­si­bly mis­guided as­sump­tion about the cor­rect­ness of their built-in pri­ors, is there any­thing else that might be harm­ful when us­ing Named Distri­bu­tions?

I think they can also be dan­ger­ous be­cause of their sim­plic­ity, that is, the lack of pa­ram­e­ters that can be tuned when fit­ting them.

Some peo­ple seem to think it’s in­her­ently bad to use com­plex mod­els in­stead of sim­ple ones when avoid­able, I can’t help but think that the peo­ple say­ing this are the same as those that say you shouldn’t quote wikipe­dia. I haven’t seen any real ar­gu­ments for this point that don’t straw-man com­plex mod­els or the peo­ple us­ing them. So I’m go­ing to dis­re­gard it for now; if any­one knows of a good ar­ti­cle/​pa­per/​book ar­gu­ing for why sim­ple mod­els are in­her­ently bet­ter, please tell me and I’ll link to it here.

Edit: After read­ing it di­ag­o­nally, it seems like this + this pro­vides a po­ten­tially good (from my per­spec­tive) defense of the idea that model com­plex­ity is in­her­ently bad, even as­sum­ing you fit prop­erly (i.e. you don’t fit di­rectly onto the whole dataset). Leav­ing this here for now, will give more de­tail when(if) I get the time to look at it more.

On the other hand, sim­ple mod­els can of­ten lead to a poor fit for the data, ul­ti­mately this will lead to shabby eval­u­a­tions of the un­der­ly­ing pro­cess (e.g. if you are eval­u­at­ing a re­sult­ing, poorly fit PDF, rather than the data it­self). Even worst, it’ll lead to mak­ing pre­dic­tions us­ing the poorly fit model and miss­ing out on free in­for­ma­tion that your data has but your model is ig­nor­ing.

Even worst, sim­ple mod­els might cause dis­card­ing of valid re­sults. All our sig­nifi­cance tests are based on very sim­plis­tic PDFs, sim­ply be­cause their in­te­grals had to be solved and com­puted in each point by hand, thus com­plex mod­els were im­pos­si­ble to use. Does this lead dis­card­ing data that doesn’t fit a sim­plis­tic PDF? Since the stan­dard sig­nifi­cant tests might give weird re­sults, due to no fault of the un­der­ly­ing pro­cess, but rather be­cause the gen­er­ated data is un­usual. To be hon­est, I don’t know, I want to say that the an­swer might be “yes” or at least “No, but that’s sim­ply be­cause one can mas­sage the phras­ing of the prob­lem in such a way as to stop this from hap­pen­ing”

I’ve pre­vi­ously dis­cussed the topic of treat­ing neu­ral net­works as a math­e­mat­i­cal ab­strac­tion for ap­prox­i­mat­ing any func­tion. So they, or some other uni­ver­sal func­tion es­ti­ma­tors, could be used to gen­er­ate dis­tri­bu­tions that perfectly fit the data they are sup­posed to model.

If the data is best mod­eled by a nor­mal dis­tri­bu­tion, the func­tion es­ti­ma­tor will gen­er­ate some­thing that be­haves like a nor­mal dis­tri­bu­tion. If we lack data but have some good pri­ors, we can build those pri­ors into the struc­ture and the loss of our func­tion es­ti­ma­tor.

Thus, rely­ing on these kinds of com­plex mod­els seems like a net pos­i­tive.

That’s not to say you should start with a 1,000M pa­ram­e­ter net­work and go from there, but rather, you should start iter­a­tively from the sim­plest pos­si­ble model and keep adding pa­ram­e­ters un­til you get a satis­fac­tory fit.

But that’s mainly for ease of train­ing, easy repli­ca­bil­ity and effi­ciency, I don’t think there’s any par­tic­u­larly strong ar­gu­ment that a 1,000M pa­ram­e­ter model, if trained prop­erly, will be in any way worst than a 2 pa­ram­e­ter model, as­sum­ing similar perfor­mance on the train­ing & test­ing data.

I’ve writ­ten a bit more about this in If Van der Waals was a neu­ral net­work, so I won’t be re­peat­ing that whole view­point here.

In conclusion

I have a spec­u­la­tive im­pres­sion that statis­ti­cal think­ing is ham­pered by wor­shiping var­i­ous named dis­tri­bu­tion and build­ing math­e­mat­i­cal ed­ifices around them.

I’m at least half-sure that some of those ed­ifices ac­tu­ally help in situ­a­tions where we lack the com­put­ing power to solve the prob­lems they model.

I’m also fairly con­vinced that, if you just scrapped the idea of us­ing Named Distri­bu­tions as a ba­sis for mod­el­ing the world, you’d just get rid of most statis­ti­cal the­ory al­to­gether.

I’m fairly con­vinced that dis­tri­bu­tion-fo­cused think­ing pi­geon­holes our mod­els of the world and stops us from look­ing at in­ter­est­ing data and ask­ing rele­vant ques­tions.

But I’m un­sure those data and ques­tions would be any more en­light­en­ing than what­ever we are do­ing now.

I think my com­plete po­si­tion here is as fol­lows:

  1. Named Distri­bu­tions can be dan­ger­ous via the side effect of peo­ple not un­der­stand­ing the strong epistemic pri­ors em­bed in them.

  2. It doesn’t seem like us­ing Named Distri­bu­tions to fit a cer­tain pro­cess is worse com­pared to us­ing any other ar­bi­trary model with equal ac­cu­racy on the val­i­da­tion data.

  3. How­ever most Named Distri­bu­tions are very sim­plis­tic mod­els and prob­a­bly can’t ac­cu­rately fit most real-world data (or deriva­tive in­for­ma­tion from that data) very well. We have mod­els that can ac­count for more com­plex­ity, with­out any clear dis­ad­van­tages, so it’s un­clear to me why we wouldn’t use those.

  4. Named Distri­bu­tions are use­ful be­cause they in­clude cer­tain use­ful pri­ors about the un­der­ly­ing re­al­ity, how­ever, they lack a way to ad­just those pri­ors, which seems like a bad enough thing to over­ride this pos­i­tive.

  5. Named Distri­bu­tions might pi­geon­hole us into cer­tain pat­terns of think­ing and bias us to­wards giv­ing more im­por­tance to the data they can fit. It’s un­clear to me how this af­fects us, my de­fault as­sump­tion would be that it’s nega­tive, as with most other ar­bi­trary con­straints that you can’t es­cape.

  6. Named Distri­bu­tions al­low for the us­age of cer­tain math­e­mat­i­cal ab­strac­tions with ease, but it’s un­clear to me whether or not this sim­plifi­ca­tion of com­pu­ta­tions is needed un­der any cir­cum­stance. For ex­am­ple, it takes a few mil­lisec­onds to com­pute the p-value for any point on any ar­bi­trary PDF us­ing python, so we don’t re­ally need to have p-value ta­bles for spe­cific sig­nifi­cance tests and their as­so­ci­ated null hy­poth­e­sis dis­tri­bu­tions.


The clock tower strikes 12, moon­light bursts into a room, find­ing Abra­ham de Moivre sit­ting at his desk. He’s com­put­ing the mean and stan­dard de­vi­a­tion for the heights of the 7 mem­bers of his house­hold, won­der­ing if this can be used as the ba­sis for the nor­mal dis­tri­bu­tion of all hu­man height.

The on­go­ing thun­der­storm con­sti­tutes an om­i­nous enough sce­nario, that, when light­ning strikes, a man in an alu­minum retro suit from a 70s space-pop mu­sic video ap­pears.

Stop what­ever you are do­ing ! says the time trav­eler. …

  • Look, this Gaus­sian dis­tri­bu­tion stuff you’re do­ing, it seems great, but it’s go­ing to lead many sci­ences astray in the fu­ture.

  • Gaus­sian?

  • Look, it doesn’t mat­ter. The thing is that a lot of ar­eas of mod­ern sci­ence have put a few dis­tri­bu­tions, in­clud­ing your stan­dard dis­tri­bu­tion, on a bit of a pedestal. Re­searchers in medicine, psy­chol­ogy, so­ciol­ogy, and even some ar­eas of chem­istry and physics must as­sume that one out of a small set of dis­tri­bu­tions can model the PDF of any­thing, or their pa­pers don’t will never get pub­lished.

… Not sure I’m fol­low­ing.

  • I know you have good rea­son to use this dis­tri­bu­tion now, but in the fu­ture, we’ll have com­put­ers that can do calcu­la­tions and gather data billions of times faster than a hu­man, so all of this ab­strac­tion, by then built into the very fiber of math­e­mat­ics, will just be con­found­ing to us.

  • So then, why not stop us­ing it?

  • Look, be­hav­ioral eco­nomics hasn’t been in­vented yet, so I don’t have time to ex­plain. It has to do with the fact that once a stan­dard is en­trenched enough, even if it’s bad, ev­ery­one must con­form to it be­cause defec­tors will be seen as hav­ing some­thing to hide.

  • Ok, but, what ex...

The door swings open and Leib­niz pops his head in

  • You men­tioned in this fu­ture there are ma­chines that can do a near-in­finite num­ber of com­pu­ta­tions.

  • Yeah, you could say that.

  • Was our math­e­matic not in­volved in cre­at­ing those ma­chines? Say, the CLT de Moivre is work­ing on now?

  • I guess so, the ad­vent of molec­u­lar physics would not have been pos­si­ble with­out the nor­mal dis­tri­bu­tions to model the be­hav­ior of var­i­ous par­ti­cles.

  • So then, if the CLT had not ex­ited, the knowl­edge of mater needed to con­struct your com­pu­ta­tion ma­chines in the first place wouldn’t have been, and pre­sum­ably you wouldn’t have been here warn­ing us about the CLT in the first place.

  • Hmph.

  • You see, the in­ven­tion of the CLT, whilst bring­ing with it some evil, is still the best pos­si­ble turn of events one could hope for, as we are liv­ing in the best of all pos­si­ble wor­lds.