Is “Regularity” another Phlogiston?

Link post

People have a tendency to say meaningless things, often without realizing it. An extreme example is to attribute a poorly understood phenomena like lightning to “magic” or “God,” which is really just a fancy way of saying “I don’t know why this thing happened.” Unfortunately, it’s possible to use non-explanations by accident, while thoroughly convinced that you are offering a real explanation. I fell prey to this for a long time, attempting to explain intelligence/​consciousness in terms of “emergence,” until Eliezer Yudkowsky changed my mind here. Who knows how much time I might have wasted studying cellular automata or some such interesting but irrelevant (to A.I.) topic if I hadn’t read that when I did. In the hopes of not repeating my mistake, I try to stay on the lookout for meaningless explanations.

I believe I have found one: “Regularity.” This occurs in a few forms. I will describe them in order of increasing uselessness. Beware that I am using this post to jot down some rough thoughts, so I have neither attempted to make the post understandable to a wide “non-technical” audience, nor worked out the technical details fully for every point.

The first is regularization. Machine learning engineers (and sometimes statisticians/​data scientists) discuss regularizing their models to avoid overfitting. This basically means that a very squiggly line can go through any number of points exactly, whereas a relatively smooth line cannot. He who can explain everything can explain nothing (this is essentially a no free lunch theorem from statistical learning theory) so the simplest hypothesis class containing the true “data generating process” (or a reasonable approximation) should be chosen. That is one formalization of the idea. However, it is often used to justify doing things like weight decay, or in its simplest form ridge regression/​lasso. In the case of ordinary least squares (essentially the linear regression model most of us learned in high school) this penalizes large weights to prevent the resulting hyperplane from having a very large slope in any direction. Since features can be nonlinear combinations of data entries this does correspond intuitively to making the line through some points “smoother.” But ridge regression actually has a rigorous justification without any hand-waving about regularization: it corresponds to the MLE assuming data are generated by a linear function with gaussian noise, and we have a gaussian prior on the weights. In fact, the idea of regularization is in my opinion simply a rough approximation to choosing the explanation with the lowest Kolmogorov complexity, which is a formalization of Ockham’s razor. This has a rigorous formalization through Solomonoff induction/​MDL which practically no one who discusses regularization seems to be aware of.

The second is the common adage that a learning process is “exploiting the regularities in the environment” to somehow learn more effectively. The problem with this idea is that, even if it is accurate, it doesn’t seem to explain anything. It is probably true that humans learn more efficiently by exploiting regularities in our environment, but then one ought to ask how evolution learned about those regularities in the first place. One obvious example is the metaphor between CNN’s and the visual system. Apparently we can learn to recognize visual objects more quickly if we build in an understanding of the spatial symmetries of images. However, I don’t think that an A.G.I. actually needs a built in CNN. I think that an A.G.I. should be able to recognize the spatial symmetry of images, and then design and delegate a CNN style module for learning to recognize objects. Perhaps I’m biased here by hoping for an elegant algorithm for intelligence, but I don’t think I’m wrong; after all, evolution designed the human visual system, which itself learns to recognize objects! I believe that “regularities in the environment” are often used to get around no free lunch theorems. That is, intuitively, if the world were just white noise, one approach would be as good as any other: useless. So ML engineers expect intuitively that learning is in general “impossible” and therefore must be possible in our universe only because of its “regularities.” This idea is a little depressing to me, since it seems to suggest there is no simple general algorithm for intelligence, but there are two problems with it. First, there’s no reason we should expect white noise as the default, and if the universe is computable, things look a lot better for prospective learners. Second, even if learning models are exploiting some regularities in the environment, I want to know what those regularities actually are! I think that most ML engineers have their curiosity extinguished before asking that question, since they aren’t physicists after all, and any fact about the world is really a question of physics. But if one wishes to appeal to facts about the world to justify learning algorithms, one is obligated to study the world as well as the learning algorithms. In fact, though there seem to be a lot of ML engineers recently, it’s my anecdotal experience that very few of them know physics, and this is probably a bad sign for the field (though I’m sure there are plenty of perfectly good ML engineers who don’t know physics, and I’m only making a general statement here that people who would be good ML engineers probably also have a propensity to study physics more often than the general software engineering population).

So, what exactly is a regularity, and which ones exist in our world?