My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data.
No. That might describe sparsification, but it doesn’t describe distillation, and in either case, it’s shameless goalpost moving—by handwaving away all the counterexamples, you’re simply no-true-Scotsmanning progress. ‘Oh, Transformers? They aren’t real performance improvements because they just learn “good representations of the data”. Oh, model sparsification and compression and distillation? They aren’t real compression because they’re just getting rid of “wasted information”.’
I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don’t really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it’s still unclear why model distillation disproves this.
No. That might describe sparsification, but it doesn’t describe distillation, and in either case, it’s shameless goalpost moving—by handwaving away all the counterexamples, you’re simply no-true-Scotsmanning progress. ‘Oh, Transformers? They aren’t real performance improvements because they just learn “good representations of the data”. Oh, model sparsification and compression and distillation? They aren’t real compression because they’re just getting rid of “wasted information”.’
I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don’t really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it’s still unclear why model distillation disproves this.