Is “Regularity” another Phlogiston?

Cole Wyeth12 Mar 2023 3:13 UTC

2 points

People have a tendency to say meaningless things, often without realizing it. An extreme example is to attribute a poorly understood phenomena like lightning to “magic” or “God,” which is really just a fancy way of saying “I don’t know why this thing happened.” Unfortunately, it’s possible to use non-explanations by accident, while thoroughly convinced that you are offering a real explanation. I fell prey to this for a long time, attempting to explain intelligence/consciousness in terms of “emergence,” until Eliezer Yudkowsky changed my mind here. Who knows how much time I might have wasted studying cellular automata or some such interesting but irrelevant (to A.I.) topic if I hadn’t read that when I did. In the hopes of not repeating my mistake, I try to stay on the lookout for meaningless explanations.

I believe I have found one: “Regularity.” This occurs in a few forms. I will describe them in order of increasing uselessness. Beware that I am using this post to jot down some rough thoughts, so I have neither attempted to make the post understandable to a wide “non-technical” audience, nor worked out the technical details fully for every point.

The first is regularization. Machine learning engineers (and sometimes statisticians/data scientists) discuss regularizing their models to avoid overfitting. This basically means that a very squiggly line can go through any number of points exactly, whereas a relatively smooth line cannot. He who can explain everything can explain nothing (this is essentially a no free lunch theorem from statistical learning theory) so the simplest hypothesis class containing the true “data generating process” (or a reasonable approximation) should be chosen. That is one formalization of the idea. However, it is often used to justify doing things like weight decay, or in its simplest form ridge regression/lasso. In the case of ordinary least squares (essentially the linear regression model most of us learned in high school) this penalizes large weights to prevent the resulting hyperplane from having a very large slope in any direction. Since features can be nonlinear combinations of data entries this does correspond intuitively to making the line through some points “smoother.” But ridge regression actually has a rigorous justification without any hand-waving about regularization: it corresponds to the MLE assuming data are generated by a linear function with gaussian noise, and we have a gaussian prior on the weights. In fact, the idea of regularization is in my opinion simply a rough approximation to choosing the explanation with the lowest Kolmogorov complexity, which is a formalization of Ockham’s razor. This has a rigorous formalization through Solomonoff induction/MDL which practically no one who discusses regularization seems to be aware of.

The second is the common adage that a learning process is “exploiting the regularities in the environment” to somehow learn more effectively. The problem with this idea is that, even if it is accurate, it doesn’t seem to explain anything. It is probably true that humans learn more efficiently by exploiting regularities in our environment, but then one ought to ask how evolution learned about those regularities in the first place. One obvious example is the metaphor between CNN’s and the visual system. Apparently we can learn to recognize visual objects more quickly if we build in an understanding of the spatial symmetries of images. However, I don’t think that an A.G.I. actually needs a built in CNN. I think that an A.G.I. should be able to recognize the spatial symmetry of images, and then design and delegate a CNN style module for learning to recognize objects. Perhaps I’m biased here by hoping for an elegant algorithm for intelligence, but I don’t think I’m wrong; after all, evolution designed the human visual system, which itself learns to recognize objects! I believe that “regularities in the environment” are often used to get around no free lunch theorems. That is, intuitively, if the world were just white noise, one approach would be as good as any other: useless. So ML engineers expect intuitively that learning is in general “impossible” and therefore must be possible in our universe only because of its “regularities.” This idea is a little depressing to me, since it seems to suggest there is no simple general algorithm for intelligence, but there are two problems with it. First, there’s no reason we should expect white noise as the default, and if the universe is computable, things look a lot better for prospective learners. Second, even if learning models are exploiting some regularities in the environment, I want to know what those regularities actually are! I think that most ML engineers have their curiosity extinguished before asking that question, since they aren’t physicists after all, and any fact about the world is really a question of physics. But if one wishes to appeal to facts about the world to justify learning algorithms, one is obligated to study the world as well as the learning algorithms. In fact, though there seem to be a lot of ML engineers recently, it’s my anecdotal experience that very few of them know physics, and this is probably a bad sign for the field (though I’m sure there are plenty of perfectly good ML engineers who don’t know physics, and I’m only making a general statement here that people who would be good ML engineers probably also have a propensity to study physics more often than the general software engineering population).

So, what exactly is a regularity, and which ones exist in our world?

Cole Wyeth12 Mar 2023 3:13 UTC

2 points

3 comments3 min readLW link

World Modeling

shminux 12 Mar 2023 7:40 UTC
4 points
2
“Regularity” is another term for “compressibility”, specifically, lossy compressibility relative to the compressing entity. And yes, they are used exactly for the purpose of “avoiding the “no free lunch” theorems, since they are both true and vacuous. Maps are abstractions, lossy compression of the territory used to minimize predictive error. Because lossy compressibility (i.e. mapping) is relative to the compressor, different compressors exploit different “regularities”. For example, you can create a different map based on audio, visual or scent information, and it takes a specific agent (i.e. compressor) to build and use each one of those.
Quintin Pope 12 Mar 2023 4:34 UTC
2 points
0
Could you elaborate on why you think regularized ML is an approximation to Solomonoff induction? My intuition is the opposite, and that these are very different processes. E.g., ML has strong inductive biases away from learning higher frequency functions^[1], whereas high frequency doesn’t imply high K-complexity.
I’d also note that regularities mainly come from the data, and not the architecture. E.g., vision transformers are competitive with CNNs, despite lacking the image invariances that supposedly help CNNs learn, and they learn human-like shape biases as they scale.
1. ^
  Imagine two classes separated by a sawtooth function of frequency $f$ . Higher $f$ would imply a more jagged decision boundary, but K-complexity increases at something like $B u s y B e a v e r^{- 1} (f)$ , which is very slow.
- Cole Wyeth 12 Mar 2023 18:09 UTC
  1 point
  0
  Parent
  When I say that regularization is an approximation to Solomonoff induction, I don’t mean its a perfect or uniform approximation. The usual justification for regularization is to decrease overfitting, that is, to improve generalization performance, by choosing the simplest model that fits the data. Intuitively, I believe Kolmogorov complexity is the right formalization for simplicity, and in that case regularization should incentivize models with lower Kolmogorov complexity.
  As you pointed out, there may be many cases in which a function with relatively low Kolmogorov complexity will still be ruled out by standard ML regularization techniques.
  In the case of linear models, L1 regularization probably does tend to decrease Kolmogorov complexity by forcing some weights to 0 - specifying less weights should take a shorter description. The picture may be more complicated regarding L2 regularization, though it is still the case that it decreases the weights, and the liminf of Kolmogorov complexity does (very very very slowly) increase to infinity, so this will asymptotically decrease Kolmogorov complexity of the weights (if not the model). However I think the heuristic argument here is very weak and the reason that L2 regularization works in linear models is because a Gaussian prior on weights is reasonable—an explanation in terms of Kolmogorov complexity is in that sense unnecessary.
  (The example in your footnote may be slightly misleading in the linear case because no choice of features allows any individual weight to control the frequency of a sawtooth function, but does demonstrate that some highly jagged functions have low Kolmogorov complexity as you said).
  You seem to be focusing on deep learning when you say ML, in which case L2 regularization is usually referred to as weight decay (or Tychonoff regularization). In this case, I have read the argument (don’t remember where) that regularized weights tend to put sigmoids in the linear regime, in which case DL approximately collapses to a linear transformation, which I would expect to have low Kolmogorov complexity. However, you may be interested in this paper https://arxiv.org/abs/1805.08522 which claims that DL actually generalizes because the parameter-function map is biased toward functions with low Kolmogorov complexity, and NOT because of weight decay or other regularization techniques (it’s not clear to me whether the authors actually argue that weight decay doesn’t lead to a simplicity bias, in which case I would be at least partially wrong, or only that weight decay is not necessary for a simplicity bias). Still, this paper does suggest that the view I expressed is the conventional one in the ML community (so is hopefully at least reasonable).
  As for your second point, I should note that I discuss two uses of the word regularity—one refers to explicitly regularizing a learned model, the other to regularities in the world (or the data). Ideally our models will pick up on regularities in the data! But I am pointing out that “regularities in the data” should not be used to “explain away” unexpected generalization performance, and should instead make us curious about which precise regularities are being exploited, and where they come from.