My first-level intuition says that if you had some sort of knob you could turn to adjust the amount of “fitting” while holding everything else constant, then “overfitting” would be when turning the knob higher makes the out-of-sample loss go up.
My more-detailed model—which I haven’t thought super long about—looks like this:
In an idealized example where you had “perfect” training data that correctly labeled every example your classifier might encounter, you would want the classifier to learn a rule that puts literally 100% of the positive examples on one side of the boundary and literally 100% of the negative examples on the other side, because anything else would be inaccurate. You’d want the classifier to get as complex as necessary to achieve that.
Some reasons you might want the classifier to stop before that extreme include:
You may have errors in your training data. You want the classifier to learn the “natural” boundary instead of finding a way to reproduce the errors.
Your data set may be missing dimensions. Maybe the “true” rule for classifying bleggs and rubes involves their color, but you can only view them through a black-and-white camera. Even if none of your training points are mislabeled, the best rule you could learn might misclassify some of them because you don’t have access to all the data that you’d need for a correct classification.
Rather than all possible data points, you may have a subset that is less-than-perfectly-representative. Drawing a line half-way between the known positive examples and known negative examples would misclassify some of the points in between that aren’t in your training set, because by bad luck there was some region of thingspace where your training set included positive examples that were very close to the true boundary but no equally-close negative examples (or vice versa).
The reason that less-than-maximum fitting might help with these is that we have an Occamian prior saying that the “true” (or best) classifying rule ought to be simple, and so instead of simply taking the best possible fit of the training data, we want to skew our result towards our priors.
Through this lens, “overfitting” could be described as giving too much weight to your training data relative to your priors.
I wonder if your more-detailed model could be included in a derivation like that in the post above. The post assumes that every observation the model has (the previous y values) is correct. Your idea of mislabelling’s or sub-perfect observations might be includable as some rule that says y’s have an X% chance of just being wrong.
We can imagine two similar models. [1] a “zoomed in” model consists of two parts, first a model for the real-world, and second a model of observation errors. [2] A “zoomed out” model that just combines the real world and the observation errors and tries to fit to that data. If [2] sees errors then the model is tweaked to predict errors. Equivalently in the maths, but importantly different in spirit is model [1] which when it encounters an obvious error does not update the world model, but might update the model of the observation errors.
My feeling is that some of this “overfitting” discussion might be fed by people intuitively wanting the model to do [1], but actually building or studying the much simpler [2]. When [2] tries to include observation errors into the same map it uses to describe the world we cry “overfitting”.
My first-level intuition says that if you had some sort of knob you could turn to adjust the amount of “fitting” while holding everything else constant, then “overfitting” would be when turning the knob higher makes the out-of-sample loss go up.
My more-detailed model—which I haven’t thought super long about—looks like this:
In an idealized example where you had “perfect” training data that correctly labeled every example your classifier might encounter, you would want the classifier to learn a rule that puts literally 100% of the positive examples on one side of the boundary and literally 100% of the negative examples on the other side, because anything else would be inaccurate. You’d want the classifier to get as complex as necessary to achieve that.
Some reasons you might want the classifier to stop before that extreme include:
You may have errors in your training data. You want the classifier to learn the “natural” boundary instead of finding a way to reproduce the errors.
Your data set may be missing dimensions. Maybe the “true” rule for classifying bleggs and rubes involves their color, but you can only view them through a black-and-white camera. Even if none of your training points are mislabeled, the best rule you could learn might misclassify some of them because you don’t have access to all the data that you’d need for a correct classification.
Rather than all possible data points, you may have a subset that is less-than-perfectly-representative. Drawing a line half-way between the known positive examples and known negative examples would misclassify some of the points in between that aren’t in your training set, because by bad luck there was some region of thingspace where your training set included positive examples that were very close to the true boundary but no equally-close negative examples (or vice versa).
The reason that less-than-maximum fitting might help with these is that we have an Occamian prior saying that the “true” (or best) classifying rule ought to be simple, and so instead of simply taking the best possible fit of the training data, we want to skew our result towards our priors.
Through this lens, “overfitting” could be described as giving too much weight to your training data relative to your priors.
I wonder if your more-detailed model could be included in a derivation like that in the post above. The post assumes that every observation the model has (the previous y values) is correct. Your idea of mislabelling’s or sub-perfect observations might be includable as some rule that says y’s have an X% chance of just being wrong.
We can imagine two similar models. [1] a “zoomed in” model consists of two parts, first a model for the real-world, and second a model of observation errors. [2] A “zoomed out” model that just combines the real world and the observation errors and tries to fit to that data. If [2] sees errors then the model is tweaked to predict errors. Equivalently in the maths, but importantly different in spirit is model [1] which when it encounters an obvious error does not update the world model, but might update the model of the observation errors.
My feeling is that some of this “overfitting” discussion might be fed by people intuitively wanting the model to do [1], but actually building or studying the much simpler [2]. When [2] tries to include observation errors into the same map it uses to describe the world we cry “overfitting”.