PhilGoetz comments on Significance of Compression Rate Method

PhilGoetz 2 Jun 2010 19:29 UTC
5 points
0
I’m afraid it was my downvote. One example of using compression is Hinton’s deep autoencoder networks, which work (although he doesn’t say this, the math does) by fine-tuning each layer so as to minimize the entropy of the node activations when presented with the items to be learned. In other words: Instead of trying to figure out what features to detect, develop features that compress the original information well. Magically, these features are very good for performing categorization with.

AI was seldom thought of as compression until about 1986. Also, AI wasn’t very good until 1986. Pre-1986, learning was ignored, and logic was king. All the pure-logic approaches suffer from combinatorial explosion, because they don’t use entropy to enumerate possibilities in order of usefulness. The hard problems of compression were hidden by supplying AI programs with knowledge already compressed into symbols in the appropriate way; but they still didn’t work, unless the number of possible actions/inferences was also restricted artificially.

There are people, like Rodney Brooks, who say logic isn’t necessary at all. I wouldn’t go that far. So, I overstated: There is work to be done in AI that isn’t about compression, except in a very abstract way. Lots of work has been done without thinking of it as being compression. But I would say that the hard stuff that gives us problems (categorization, similarity; recognizing, recalling, and managing state-space trajectories) is closely tied to questions of compression.
- marks 6 Jun 2010 3:41 UTC
  0 points
  0
  Parent
  I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it’s just that the key insight in the compression is not to just “minimize entropy” but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).
  
  Namely, in Hinton’s algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he’s more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the “deeper” nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).
  
  Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.