Of course, most of us would be very skeptical. Not just because insights of that magnitude are rarely ever discovered by a single person or small team of people, but also because it’s hard to see how there could be a simple core to image classification. The reason why you can recognize a cat is not because cats are simple things in thingspace and are therefore easily identifiable; it’s because there are a bunch of things that make cats cat-like, and you understand a lot about the world. Current image classifiers recognize cats because they have learned a bunch of features: whiskers, ears, legs, fur, eyes, tails etc. and they leverage this learned knowledge to identify cats. Humans recognize cats because they have learned a bunch of information about animals, bodies, moving objects, and some domain specific information about cats, and they leverage this learned knowledge to identify cats. Either way, there’s no way around the fact that you need to know a lot in order to understand what is and isn’t a cat. Image classification just isn’t the type of thing that should be easily compressible, because by compressing it, you lose important learned information that can be used to identify features of the world. In fact, I think we can say the same about many areas of intelligence.
According to you, the entire field of model distillation & compression, whose paradigmatic use-case is compressing image classification CNNs down to sizes like 10% or 1% (or less) and running it on your smartphone, which is not even that hard in practice, is impossible and cannot exist. That seems a little puzzling.
My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.
My point was that you couldn’t do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.
My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data.
No. That might describe sparsification, but it doesn’t describe distillation, and in either case, it’s shameless goalpost moving—by handwaving away all the counterexamples, you’re simply no-true-Scotsmanning progress. ‘Oh, Transformers? They aren’t real performance improvements because they just learn “good representations of the data”. Oh, model sparsification and compression and distillation? They aren’t real compression because they’re just getting rid of “wasted information”.’
I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don’t really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it’s still unclear why model distillation disproves this.
According to you, the entire field of model distillation & compression, whose paradigmatic use-case is compressing image classification CNNs down to sizes like 10% or 1% (or less) and running it on your smartphone, which is not even that hard in practice, is impossible and cannot exist. That seems a little puzzling.
My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.
My point was that you couldn’t do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.
No. That might describe sparsification, but it doesn’t describe distillation, and in either case, it’s shameless goalpost moving—by handwaving away all the counterexamples, you’re simply no-true-Scotsmanning progress. ‘Oh, Transformers? They aren’t real performance improvements because they just learn “good representations of the data”. Oh, model sparsification and compression and distillation? They aren’t real compression because they’re just getting rid of “wasted information”.’
I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don’t really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it’s still unclear why model distillation disproves this.