nostalgebraist comments on Why Neural Networks Generalise, and Why They Are (Kind of) Bayesian

nostalgebraist 4 Mar 2021 6:27 UTC
LW: 9 AF: 4
0
AF
Most complexity measures give roughly similar values for the (relative) complexity of most objects
I’ll write mostly about this statement, as I think it’s the crux of our disagreement.
The statement may be true as long as we hold the meaning of “objects” constant as we vary the complexity measure.
However, if we translate objects from one mathematical space to another (say by discretizing, or adding/removing a metric structure), we can’t simply say the complexity measures for space A on the original A-objects inevitably agree with those space B on the translated B-objects. Whether this is true depends on our choice of translation.
(This is clear in the trivial cases of bad translation where we, say, map every A-object onto the same B-object. Now, obviously, no one would consider this a correct or adequate way to associate A-objects with B-objects. But the example shows that the claim about complexity measures will only hold if our translation is “good enough” in some sense. If we don’t have any idea what “good enough” means, something is missing from the story.)
In the problem at hand, the worrying part of the translation from real to boolean inputs is the loss of metric structure. (More precisely, the hand-waviness about what metric structure survives the translation, if any.) If there’s no metric, this destroys the information needed by complexity measures that care about how easy it is to reconstruct an object “close to” the specified one.
Basic information theory doesn’t require a metric, only a measure. There’s no sense of “getting an output approximately right,” only of “getting the exactly right output with high probability.” If you care about being approximately right according to some metric, this leads you to rate-distortion theory.
Both of these domains—information theory without a metric, and with one—define notions of incompressibility/complexity, but they’re different. Consider two distributions on R:
1. The standard normal,
2. The standard normal, but you chop it into a trillion pieces on the x axis, and translate the pieces to arbitrary locations in R
According to basic information theory, these are equally simple/compressible. (They have the same differential entropy, or the same K-L divergence from a uniform distribution if you want to be pedantic.)
But in rate-distortion theory, (1) is way more simple/compressible than (2). If you’re coding (2) over a noisy channel, you have to distinguish really hard between (say) a piece that stayed in place at [0, 0.1] and another piece that got translated to [1e8, 1e8 + 0.1]. Whereas if you’re coding a standard normal, with its light tails, a 1e8-magnitude mistake is effectively impossible.
If you do all your analysis in the metric-less space, hoping it will cleanly pass over to the metric space at the end, you have no way of distinguishing these two possibilities. When you remove the metric, they’re identical. So you have limited power to predict what the rate-distortion theory notion of complexity is going to say, once you put the metric back in.