Say our points x(n) are the times of day measured by a clock. And y(n) are the temperatures measured by a thermometer at those times. We’re putting in times x(n) in the early morning, where I decree temperature to increase roughly linearly as the sun rises.
You write the overparametrized regression model as y(n)=cx(n)+ξ(n) . Since our model doesn’t get to see the index, only the value of x(n) itself, that has to implicitly be something like h(x)=cx+∑nδ(x−x(n))ξ(n)
Where h is the regression or NN output. So our model learned the slope, plus a lookup table for the noise values of the thermometer at those times in the training data set. That means that if the training set included the time x(8)=8:00am, and the model encounters a temperature taken at x(N+1)=8:00am outside training again, now from a different day, it will output y(x)=cx(N+1)+∑nδ(x(N+1)−x(n))ξ(n)=cx(N+1)+ξ(8).
Which is predictably wrong, and you can do better by not having that memorised ξ(8) noise term.
The model doesn’t get to make a general model plus a lookup table of noises in training to get perfect loss and then use only the general model outside of training. It can’t switch the lookup table off.
Put differently, if there’s patterns in the data that the model cannot possibly make a decent simple generative mechanism for, fitting those patterns to get a better loss doesn’t seem like the right thing to do.
Put yet another way, if you’re forced to pick one single hypothesis to make predictions, the best one to pick doesn’t necessarily come from the set y(n)=h(x(n)) that perfectly fits all past data.
I’m confused about this.
Say our points x(n) are the times of day measured by a clock. And y(n) are the temperatures measured by a thermometer at those times. We’re putting in times x(n) in the early morning, where I decree temperature to increase roughly linearly as the sun rises.
You write the overparametrized regression model as y(n)=cx(n)+ξ(n) . Since our model doesn’t get to see the index, only the value of x(n) itself, that has to implicitly be something like h(x)=cx+∑nδ(x−x(n))ξ(n)
Where h is the regression or NN output. So our model learned the slope, plus a lookup table for the noise values of the thermometer at those times in the training data set. That means that if the training set included the time x(8)=8:00am, and the model encounters a temperature taken at x(N+1)=8:00am outside training again, now from a different day, it will output y(x)=cx(N+1)+∑nδ(x(N+1)−x(n))ξ(n)=cx(N+1)+ξ(8).
Which is predictably wrong, and you can do better by not having that memorised ξ(8) noise term.
The model doesn’t get to make a general model plus a lookup table of noises in training to get perfect loss and then use only the general model outside of training. It can’t switch the lookup table off.
Put differently, if there’s patterns in the data that the model cannot possibly make a decent simple generative mechanism for, fitting those patterns to get a better loss doesn’t seem like the right thing to do.
Put yet another way, if you’re forced to pick one single hypothesis to make predictions, the best one to pick doesn’t necessarily come from the set y(n)=h(x(n)) that perfectly fits all past data.