Pattern comments on Named Distributions as Artifacts

Pattern 6 May 2020 0:41 UTC
4 points
When we teach people the formula we should draw attention to the difference: “The left-hand side is Pyathagor’s formula, the right-hand side is this artifact which is kind of useful, but [1] there’s no property of our mathematics that exactly define what a slightly-off right-angled triangle is or [2] tells us it should fit this rule”.
[1] There probably is. (Though the idea that real triangles are exactly like mathematical triangles and that can be proved via logic might be wrong.)
[2] And it says tells you exactly how wrong the rule is based on what the triangle is actually like.
Well, that assumes the phenomenon fits a 2,000,000 parameter equation using some combination of the finite set of operations provided by the network (e.g. +, -, >, ==, max, min,avg).
Or that a 2,000,000 parameter equation will make a good approximation. (I’m not sure if that’s what you meant by “fit”.) If you have some assumptions, and use math correctly to find that the height of something is 4 ft, but it’s actually 5 ft, then the assumptions aren’t a perfect fit.
So I’m going to disregard it for now; if anyone knows of a good article/paper/book arguing for why simple models are inherently better, please tell me and I’ll link to it here.
Here’s an argument made in response to your article:
Suppose I have 100 datapoints and I come with a polynomial that fits all of them, with “degree 99”. How close do you think that the polynomial is to the real function? Even if I the datapoints are all 100% accurate, and the real function is a polynomial, there is no redundancy at all. Whereas if the polynomial was of degree 3, then 4 points is enough to come up with the rule, and the other 96 points just verify it (within it’s paradigm). When there’s no redundancy the “everything is a polynomial” paradigm doesn’t seem justified. When there’s 96 redundant points, out of 100 points, it seems like polynomials are a really good fit.
(In other words, it’s not clear how a complicated model compresses rather than obfuscates the data—though what is “complicated” is a function of the data available.)
We have models that can account for more complexity, without any clear disadvantages, so it’s unclear to me why we wouldn’t use those.
This article focused heavily on Named Distributions, and not a lot on these alternatives. (NNs were mentioned in passing.)
You see, the invention of the CLT, whilst bringing with it some evil, is still the best possible turn of events one could hope for, as we are living in the best of all possible worlds.
That sounds like an interesting bit of history.