The training sample of size m is then used to compute the n-fold cross-validation error R_CV(θ) for a small number of possible values of θ. θ is next set to the value θ_0 for which R_CV(θ) is smallest and the algorithm is trained with the parameter setting θ_0 over the full training sample of size m
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.
Allow me to quote directly from the book:
So, I use cross-validation to choose a model. Then I use the same data to train the model. Insanity ensues.
Besides, even cross-validation for model-selection is suspicious. Shouldn’t I, ideally, train all model with all the data and form a posterior on the most probable values?
Why? A model has two components: the hyperparameters and the parameters. The hyperparameters are inputs to the model, and the parameters are calculated from the hyperparameters and the training data. (This is a very similar approach to what are called ‘hierarchical Bayesian models.’)
Instead of pulling a prior out of thin air for the hyperparameters, this asks the question “which hyperparameters are best for generalizing models to test sets outside the training set?”, which is a different question from “which parameters maximize the likelihood of this data?”
(I should add that some people call it ‘cross-tuning’ to report a model whose hyperparameters have been selected by this sort of process, if there’s no third dataset used for testing that was not used for tuning. Standard practice in ML is to still refer to it as ‘cross-validation.’)
If you do this, how will you get an estimate of how well your model is able to predict outside of the training set?
But once they do have the hyperparameter in place, this is what they do—they fit the model on the full training data, so that they can make the most use of everything.