Quintin Pope comments on Is “Regularity” another Phlogiston?

Quintin Pope 12 Mar 2023 4:34 UTC
2 points
0
Could you elaborate on why you think regularized ML is an approximation to Solomonoff induction? My intuition is the opposite, and that these are very different processes. E.g., ML has strong inductive biases away from learning higher frequency functions^[1], whereas high frequency doesn’t imply high K-complexity.
I’d also note that regularities mainly come from the data, and not the architecture. E.g., vision transformers are competitive with CNNs, despite lacking the image invariances that supposedly help CNNs learn, and they learn human-like shape biases as they scale.
1. ^
  Imagine two classes separated by a sawtooth function of frequency $f$ . Higher $f$ would imply a more jagged decision boundary, but K-complexity increases at something like $B u s y B e a v e r^{- 1} (f)$ , which is very slow.
- Cole Wyeth 12 Mar 2023 18:09 UTC
  1 point
  0
  Parent
  When I say that regularization is an approximation to Solomonoff induction, I don’t mean its a perfect or uniform approximation. The usual justification for regularization is to decrease overfitting, that is, to improve generalization performance, by choosing the simplest model that fits the data. Intuitively, I believe Kolmogorov complexity is the right formalization for simplicity, and in that case regularization should incentivize models with lower Kolmogorov complexity.
  As you pointed out, there may be many cases in which a function with relatively low Kolmogorov complexity will still be ruled out by standard ML regularization techniques.
  In the case of linear models, L1 regularization probably does tend to decrease Kolmogorov complexity by forcing some weights to 0 - specifying less weights should take a shorter description. The picture may be more complicated regarding L2 regularization, though it is still the case that it decreases the weights, and the liminf of Kolmogorov complexity does (very very very slowly) increase to infinity, so this will asymptotically decrease Kolmogorov complexity of the weights (if not the model). However I think the heuristic argument here is very weak and the reason that L2 regularization works in linear models is because a Gaussian prior on weights is reasonable—an explanation in terms of Kolmogorov complexity is in that sense unnecessary.
  (The example in your footnote may be slightly misleading in the linear case because no choice of features allows any individual weight to control the frequency of a sawtooth function, but does demonstrate that some highly jagged functions have low Kolmogorov complexity as you said).
  You seem to be focusing on deep learning when you say ML, in which case L2 regularization is usually referred to as weight decay (or Tychonoff regularization). In this case, I have read the argument (don’t remember where) that regularized weights tend to put sigmoids in the linear regime, in which case DL approximately collapses to a linear transformation, which I would expect to have low Kolmogorov complexity. However, you may be interested in this paper https://arxiv.org/abs/1805.08522 which claims that DL actually generalizes because the parameter-function map is biased toward functions with low Kolmogorov complexity, and NOT because of weight decay or other regularization techniques (it’s not clear to me whether the authors actually argue that weight decay doesn’t lead to a simplicity bias, in which case I would be at least partially wrong, or only that weight decay is not necessary for a simplicity bias). Still, this paper does suggest that the view I expressed is the conventional one in the ML community (so is hopefully at least reasonable).
  As for your second point, I should note that I discuss two uses of the word regularity—one refers to explicitly regularizing a learned model, the other to regularities in the world (or the data). Ideally our models will pick up on regularities in the data! But I am pointing out that “regularities in the data” should not be used to “explain away” unexpected generalization performance, and should instead make us curious about which precise regularities are being exploited, and where they come from.