My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.
My point was that you couldn’t do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.
My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data.
No. That might describe sparsification, but it doesn’t describe distillation, and in either case, it’s shameless goalpost moving—by handwaving away all the counterexamples, you’re simply no-true-Scotsmanning progress. ‘Oh, Transformers? They aren’t real performance improvements because they just learn “good representations of the data”. Oh, model sparsification and compression and distillation? They aren’t real compression because they’re just getting rid of “wasted information”.’
I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don’t really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it’s still unclear why model distillation disproves this.
My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.
My point was that you couldn’t do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.
No. That might describe sparsification, but it doesn’t describe distillation, and in either case, it’s shameless goalpost moving—by handwaving away all the counterexamples, you’re simply no-true-Scotsmanning progress. ‘Oh, Transformers? They aren’t real performance improvements because they just learn “good representations of the data”. Oh, model sparsification and compression and distillation? They aren’t real compression because they’re just getting rid of “wasted information”.’
I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don’t really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it’s still unclear why model distillation disproves this.