Any model does compression, period. What is the particular relevance to AI? And if that was your downvote on my other comment, how does thinking of AI in terms of compression help to develop AI?
I’m afraid it was my downvote. One example of using compression is Hinton’s deep autoencoder networks, which work (although he doesn’t say this, the math does) by fine-tuning each layer so as to minimize the entropy of the node activations when presented with the items to be learned. In other words: Instead of trying to figure out what features to detect, develop features that compress the original information well. Magically, these features are very good for performing categorization with.
AI was seldom thought of as compression until about 1986. Also, AI wasn’t very good until 1986. Pre-1986, learning was ignored, and logic was king. All the pure-logic approaches suffer from combinatorial explosion, because they don’t use entropy to enumerate possibilities in order of usefulness. The hard problems of compression were hidden by supplying AI programs with knowledge already compressed into symbols in the appropriate way; but they still didn’t work, unless the number of possible actions/inferences was also restricted artificially.
There are people, like Rodney Brooks, who say logic isn’t necessary at all. I wouldn’t go that far. So, I overstated: There is work to be done in AI that isn’t about compression, except in a very abstract way. Lots of work has been done without thinking of it as being compression. But I would say that the hard stuff that gives us problems (categorization, similarity; recognizing, recalling, and managing state-space trajectories) is closely tied to questions of compression.
I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it’s just that the key insight in the compression is not to just “minimize entropy” but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).
Namely, in Hinton’s algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he’s more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the “deeper” nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).
Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.
(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google “error correcting codes machine learning”)[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learning] (arguably this has been the clearest example of a paradigm shift that comes from thinking about compression which had a big impact in machine learning). I don’t know how many people think about these problems in terms of information theory questions (since I don’t have much access to their thoughts): but I do know at least two very competent researchers who, although they never bring it outright into their papers, they have an information-theory and compression-oriented way of posing and thinking about problems.
I often try to think of how humans process speech in terms of information theory (which is inspired by a couple of great thinkers in the area), and thus I think that it is useful for understanding and probing the questions of sensory perception.
There’s also a whole literature on “sparse coding” (another compression-oriented idea originally developed by biologist but since ported over by computer vision and a few speech researchers) whose promise in machine learning may not have been realized yet, but I have seen at least a couple somewhat impressive applications of related techniques appearing.
Any model does compression, period. What is the particular relevance to AI? And if that was your downvote on my other comment, how does thinking of AI in terms of compression help to develop AI?
I’m afraid it was my downvote. One example of using compression is Hinton’s deep autoencoder networks, which work (although he doesn’t say this, the math does) by fine-tuning each layer so as to minimize the entropy of the node activations when presented with the items to be learned. In other words: Instead of trying to figure out what features to detect, develop features that compress the original information well. Magically, these features are very good for performing categorization with.
AI was seldom thought of as compression until about 1986. Also, AI wasn’t very good until 1986. Pre-1986, learning was ignored, and logic was king. All the pure-logic approaches suffer from combinatorial explosion, because they don’t use entropy to enumerate possibilities in order of usefulness. The hard problems of compression were hidden by supplying AI programs with knowledge already compressed into symbols in the appropriate way; but they still didn’t work, unless the number of possible actions/inferences was also restricted artificially.
There are people, like Rodney Brooks, who say logic isn’t necessary at all. I wouldn’t go that far. So, I overstated: There is work to be done in AI that isn’t about compression, except in a very abstract way. Lots of work has been done without thinking of it as being compression. But I would say that the hard stuff that gives us problems (categorization, similarity; recognizing, recalling, and managing state-space trajectories) is closely tied to questions of compression.
I have a minor disagreement, which I think supports your general point. There is definitely a type of compression going on in the algorithm, it’s just that the key insight in the compression is not to just “minimize entropy” but rather make the outputs of the encoder behave in a similar manner as the observed data. Indeed, that was one of the major insights in information theory is that one wants the encoding scheme to capture the properties of the distribution over the messages (and hence over alphabets).
Namely, in Hinton’s algorithm the outputs of the encoder are fed through a logistic function and then the cross-entropy is minimized (essentially the KL divergence). It seems that he’s more providing something like a reparameterization of a probability mass function for pixel intensities which is a logistic distribution when conditioned on the “deeper” nodes. Minimizing that KL divergence means that the distribution is made to be statistically indistinguishable from the distribution over the data intensities (since the KL-divergence minimizes expected log likelihood ratio-which means minimizing the power over the uniformly most powerful test).
Minimizing entropy blindly would mean the neural network nodes would give constant output: which is very compressive but utterly useless.
(A text with some decent discussion on the topic)[http://www.inference.phy.cam.ac.uk/mackay/itila/book.html]. At least one group that has a shot at winning a major speech recognition benchmark competition uses information-theoretic ideas for the development of their speech recognizer. Another development has been the use of error-correcting codes to assist in multi-class classification problems (google “error correcting codes machine learning”)[http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=error+correcting+codes+machine+learning] (arguably this has been the clearest example of a paradigm shift that comes from thinking about compression which had a big impact in machine learning). I don’t know how many people think about these problems in terms of information theory questions (since I don’t have much access to their thoughts): but I do know at least two very competent researchers who, although they never bring it outright into their papers, they have an information-theory and compression-oriented way of posing and thinking about problems.
I often try to think of how humans process speech in terms of information theory (which is inspired by a couple of great thinkers in the area), and thus I think that it is useful for understanding and probing the questions of sensory perception.
There’s also a whole literature on “sparse coding” (another compression-oriented idea originally developed by biologist but since ported over by computer vision and a few speech researchers) whose promise in machine learning may not have been realized yet, but I have seen at least a couple somewhat impressive applications of related techniques appearing.
Thanks to you and PhilGoetz for those references. I have updated my estimate of the subject.