I wasn’t doing learning theory in 2016, but the cannonical textbook, Shalev-Shwartz & Ben-David (2014), covers both nonuniform learning and PAC Bayes, so I’m a bit confused because both of those approaches were known at the time and sidestep the killer results from Zhang et al. (2016).
In nonuniform learning, you split up your hypothesis class into a union of countably many smaller classes , use VC Dimension or Rademacher complexity to get generalization bounds for each, weight them some way like , and then with probability receive a bound that looks like
where the generalization bound for is the one that holds with probability . Based off this bound, you’d want to optimize a linear combination of the training loss and the model’s complexity (which controls the generalization bound), and you can still end up with models that provably generalize even though your overall hypothesis class contains extremely complicated models that don’t generalize.
Wouldn’t a deep learning theory researcher in 2016 say that Zhang et al. (2016) simply proves that the success of neural networks involves some kind of nonuniform learning using regularization implicit in SGD, so we just need to find the right measure of complexity (and evidently, weight norm was not the right one) to get good generalization bounds?
I wasn’t doing learning theory in 2016, but the cannonical textbook, Shalev-Shwartz & Ben-David (2014), covers both nonuniform learning and PAC Bayes, so I’m a bit confused because both of those approaches were known at the time and sidestep the killer results from Zhang et al. (2016).
In nonuniform learning, you split up your hypothesis class into a union of countably many smaller classes , use VC Dimension or Rademacher complexity to get generalization bounds for each, weight them some way like , and then with probability receive a bound that looks like
where the generalization bound for is the one that holds with probability . Based off this bound, you’d want to optimize a linear combination of the training loss and the model’s complexity (which controls the generalization bound), and you can still end up with models that provably generalize even though your overall hypothesis class contains extremely complicated models that don’t generalize.
Wouldn’t a deep learning theory researcher in 2016 say that Zhang et al. (2016) simply proves that the success of neural networks involves some kind of nonuniform learning using regularization implicit in SGD, so we just need to find the right measure of complexity (and evidently, weight norm was not the right one) to get good generalization bounds?