Thanks, this clarifies many things! Thanks also for linking to your very comprehensive post on generalization.
To be clear, I didn’t mean to claim that VC theory explains NN generalization. It is indeed famously bad at explaining modern ML. But “models have singularities and thus number of parameters is not a good complexity measure” is not a valid criticism of VC theory. If SLT indeed helps figure out the mysteries from the “understanding deep learning...” paper then that will be amazing!
But what we’d really like to get at is an understanding of how perturbations to the true distribution lead to changes in model behavior.
Ah, I didn’t realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us “simpler” models and simpler models generalize better.
Ah, I didn’t realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us “simpler” models and simpler models generalize better.
Sumio Watanabe has two papers on out of distribution generalization:
In supervised learning, we commonly assume that training and test data are sampled from the same distribution. However, this assumption can be violated in practice and then standard machine learning techniques perform poorly. This paper focuses on revealing and improving the performance of Bayesian estimation when the training and test distributions are different. We formally analyze the asymptotic Bayesian generalization error and establish its upper bound under a very general setting. Our important finding is that lower order terms—which can be ignored in the absence of the distribution change—play an important role under the distribution change. We also propose a novel variant of stochastic complexity which can be used for choosing an appropriate model and hyper-parameters under a particular distribution change.
In the standard setting of statistical learning theory, we assume that the training and test data are generated from the same distribution. However, this assumption cannot hold in many practical cases, e.g., brain-computer interfacing, bioinformatics, etc. Especially, changing input distribution in the regression problem often occurs, and is known as the covariate shift. There are a lot of studies to adapt the change, since the ordinary machine learning methods do not work properly under the shift. The asymptotic theory has also been developed in the Bayesian inference. Although many effective results are reported on statistical regular ones, the non-regular models have not been considered well. This paper focuses on behaviors of non-regular models under the covariate shift. In the former study [1], we formally revealed the factors changing the generalization error and established its upper bound. We here report that the experimental results support the theoretical findings. Moreover it is observed that the basis function in the model plays an important role in some cases.
But “models have singularities and thus number of parameters is not a good complexity measure” is not a valid criticism of VC theory.
Right, this quote is really a criticism of the classical Bayesian Information Criterion (for which the “Widely applicable Bayesian Information Criterion” WBIC is the relevant SLT generalization).
Ah, I didn’t realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us “simpler” models and simpler models generalize better.
That’s right: existing work is about in-distribution generalization. It is the case that, within the Bayesian setting, SLT provides an essentially complete account of in-distribution generalization. As you’ve pointed out there are remaining differences between Bayes and SGD. We’re working on applications to OOD but have not put anything out publicly about this yet.
Thanks, this clarifies many things! Thanks also for linking to your very comprehensive post on generalization.
To be clear, I didn’t mean to claim that VC theory explains NN generalization. It is indeed famously bad at explaining modern ML. But “models have singularities and thus number of parameters is not a good complexity measure” is not a valid criticism of VC theory. If SLT indeed helps figure out the mysteries from the “understanding deep learning...” paper then that will be amazing!
Ah, I didn’t realize earlier that this was the goal. Are there any theorems that use SLT to quantify out-of-distribution generalization? The SLT papers I have read so far seem to still be talking about in-distribution generalization, with the added comment that Bayesian learning/SGD is more likely to give us “simpler” models and simpler models generalize better.
Sumio Watanabe has two papers on out of distribution generalization:
Asymptotic Bayesian generalization error when training and test distributions are different
Experimental Bayesian Generalization Error of Non-regular Models under Covariate Shift
Right, this quote is really a criticism of the classical Bayesian Information Criterion (for which the “Widely applicable Bayesian Information Criterion” WBIC is the relevant SLT generalization).
That’s right: existing work is about in-distribution generalization. It is the case that, within the Bayesian setting, SLT provides an essentially complete account of in-distribution generalization. As you’ve pointed out there are remaining differences between Bayes and SGD. We’re working on applications to OOD but have not put anything out publicly about this yet.