Can we switch to the interpolation regime early if we, before reaching the peak, tell it to keep the loss constant? Aka we are at loss l* and replace the loss function l(theta) with |l(theta)-l*| or (l(theta)-l*)^2.
Interesting! Given that stochastic gradient descent (SGD) does provide an inductive bias towards models that generalize better, it does seem like changing the loss function in this way could enhance generalization performance. Broadly speaking, SGD’s bias only provides a benefit when it is searching over many possible models: it performs badly at the interpolation threshold because the lowish complexity limits convergence to a small number of overfitted models. Creating a loss function that allows SGD more reign over the model it selects could therefore improve generalization.
If
#1 SGDis inductively biased to more generalizeable models in general
#2 an (l(θ)−l∗)2 loss-function gives all models with near l∗ a wider local minimum
#3 there are many different models where l(θ)≈l∗ at a given level of complexity as long as l∗>0
then it’s plausible that changing the loss-function in this way will help emphasize SGD’s bias towards models that generalize better. Point #1 is an explanation for double-descent. Point #2 seems intuitive to me (it makes the loss-function more convex and flatter when models are better performing) and Point #3 does too: there are many different sets of prediction that will all partially fit the training-dataset and yield the same loss function value of l∗, which implies that there are also many different predictive models that yield such a loss function.
To illustrate point #3 above, imagine we’re trying to fit the set of training observations {→x1,→x2,→x3,...,→xi,...→xn}. Fully overfitting this set (getting l(θ)≈0) requires us to get all →xi from 1 to ncorrect. However, we can partially overfit this set (getting l(θ)=l∗) in a variety of different ways. For instance, if we get all →xi correct except for →xj, we may have roughly n different ways we can pick →xj that could yield the same l(θ)).[1] Consequently, our stochastic gradient descent process is free to apply its inductive bias to a broad set of models that have similar performances but make different predictions.
[1] This isn’t exactly true because getting only one →xj wrong without changing the predictions for other →xi might only be achievable by increasing complexity since some predictions may be correlated with each other but it demonstrates the basic idea
Interesting! Given that stochastic gradient descent (SGD) does provide an inductive bias towards models that generalize better, it does seem like changing the loss function in this way could enhance generalization performance. Broadly speaking, SGD’s bias only provides a benefit when it is searching over many possible models: it performs badly at the interpolation threshold because the lowish complexity limits convergence to a small number of overfitted models. Creating a loss function that allows SGD more reign over the model it selects could therefore improve generalization.
If
#1 SGD is inductively biased to more generalizeable models in general
#2 an (l(θ)−l∗)2 loss-function gives all models with near l∗ a wider local minimum
#3 there are many different models where l(θ)≈l∗ at a given level of complexity as long as l∗>0
then it’s plausible that changing the loss-function in this way will help emphasize SGD’s bias towards models that generalize better. Point #1 is an explanation for double-descent. Point #2 seems intuitive to me (it makes the loss-function more convex and flatter when models are better performing) and Point #3 does too: there are many different sets of prediction that will all partially fit the training-dataset and yield the same loss function value of l∗, which implies that there are also many different predictive models that yield such a loss function.
To illustrate point #3 above, imagine we’re trying to fit the set of training observations {→x1,→x2,→x3,...,→xi,...→xn}. Fully overfitting this set (getting l(θ)≈0) requires us to get all →xi from 1 to ncorrect. However, we can partially overfit this set (getting l(θ)=l∗) in a variety of different ways. For instance, if we get all →xi correct except for →xj, we may have roughly n different ways we can pick →xj that could yield the same l(θ)).[1] Consequently, our stochastic gradient descent process is free to apply its inductive bias to a broad set of models that have similar performances but make different predictions.
[1] This isn’t exactly true because getting only one →xj wrong without changing the predictions for other →xi might only be achievable by increasing complexity since some predictions may be correlated with each other but it demonstrates the basic idea