If model predictions are consistently too low/high:
If using gradient descent, might be a bad starting point
To diagnose: Try running it with number of training rounds and/or learning rate set to/near zero, and seeing if it predicts an unsuitable value for everything.
To fix: Set the starting point to the average outcome in the training set.
If using gradient descent, might be numerical instability
To diagnose: Watch how individual and aggregate predictions change from round to round. If they flicker back and forth (with unornamented gradient descent) or swing back and forth like a pendulum (with momentum), it’s instability.
To fix: Lower learning rate; possibly increase number of rounds to compensate.
Might be distribution shift
To diagnose: See if problem is present in training set too, or just test set; if the latter, you’re looking at distribution shift.
To fix: Extend trends in existing data outwards to the future; add or otherwise adjust-by correction factors; alternately, give up and get more applicable data.
If highs are too high and lows are too low:
This is the default
To diagnose: Confirm whether you are using an MLE-based modelling algorithm (you are), are using a finite amount of data (you are), are modelling a process which isn’t perfectly predictable from its explanatory variables (you are), and exist in reality (you do). If these things are true, then yes, this will happen, and the only question is how hard it’s happening and whether/to what extent you want to correct for it.
To fix: Apply a penalty term.
Might also be change in parameters
To diagnose: Check whether explanatory variables behave very differently in training and test sets.
To fix: Try not making mistakes, and/or fixing them when you do make them.
Is almost guaranteed to happen due to distribution shift
To diagnose: Test out of context (i.e. NOT random-split) and see how far apart the lines on your AvE graphs get.
To fix: Apply a larger penalty term, and/or post-hoc adjustments, until lines align again. Or just get more relevant data.
If highs are too low and lows are too high:
Could, conceivably, be undertraining and undercomplication
To diagnose: See if more training and model complexity improves performance on a true outsample.
To fix: Use more training and model complexity.
Is much more likely to be you evaluating on the training set, or doing something isomorphic
To diagnose: Check whether you’re evaluating on the training set, or doing something isomorphic.
To fix: Don’t evaluate on the training set, or do anything isomorphic.
Is almost certainly NOT happening due to distribution shift
To diagnose: I . . . guess you’d look at the difference between training and deployment? And see whether the inevitable apparent underfit on the training set is actually greater in the test set?
To fix: If you see this happening to a meaningful extent, your modelling context is cursed. Don’t try anything clever, just get a more relevant dataset – or a less cursed project – and don’t look back.
If highs and lows are both making the same error, and middle predictions are making the opposite error:
Could be an inappropriate linkage
To diagnose: Try other linkages and see if they work better. In particular, if you’re using additive(/unity) linkage to predict a price, and see both your highs and lows are too low, try multiplicative(/log) linkage.
To fix: If other linkages work better, use them.
Could be a bound, or other effect of output on output
To diagnose: If the best linkage still isn’t good enough, but the problem is still present in the training set, it’s probably this.
To fix: Just increase model complexity until the problem goes away. Model complexity is a complete and appropriate solution to this problem. You don’t have to do anything else. [Intended affect: hostage reading ransom note on camera.]
If zigzag:
Could be multiple bounds, or multiple other effects of output on output
To diagnose: If you’re sure it’s not noise and it consistently looks like this I have no other interpretation.
To fix: Again, just raise model complexity. [Intended affect: hostage rereading part of ransom note because kidnappers say they didn’t enunciate right the first time.]
If model predictions are consistently too low/high:
If using gradient descent, might be a bad starting point
To diagnose: Try running it with number of training rounds and/or learning rate set to/near zero, and seeing if it predicts an unsuitable value for everything.
To fix: Set the starting point to the average outcome in the training set.
If using gradient descent, might be numerical instability
To diagnose: Watch how individual and aggregate predictions change from round to round. If they flicker back and forth (with unornamented gradient descent) or swing back and forth like a pendulum (with momentum), it’s instability.
To fix: Lower learning rate; possibly increase number of rounds to compensate.
Might be distribution shift
To diagnose: See if problem is present in training set too, or just test set; if the latter, you’re looking at distribution shift.
To fix: Extend trends in existing data outwards to the future; add or otherwise adjust-by correction factors; alternately, give up and get more applicable data.
If highs are too high and lows are too low:
This is the default
To diagnose: Confirm whether you are using an MLE-based modelling algorithm (you are), are using a finite amount of data (you are), are modelling a process which isn’t perfectly predictable from its explanatory variables (you are), and exist in reality (you do). If these things are true, then yes, this will happen, and the only question is how hard it’s happening and whether/to what extent you want to correct for it.
To fix: Apply a penalty term.
Might also be change in parameters
To diagnose: Check whether explanatory variables behave very differently in training and test sets.
To fix: Try not making mistakes, and/or fixing them when you do make them.
Is almost guaranteed to happen due to distribution shift
To diagnose: Test out of context (i.e. NOT random-split) and see how far apart the lines on your AvE graphs get.
To fix: Apply a larger penalty term, and/or post-hoc adjustments, until lines align again. Or just get more relevant data.
If highs are too low and lows are too high:
Could, conceivably, be undertraining and undercomplication
To diagnose: See if more training and model complexity improves performance on a true outsample.
To fix: Use more training and model complexity.
Is much more likely to be you evaluating on the training set, or doing something isomorphic
To diagnose: Check whether you’re evaluating on the training set, or doing something isomorphic.
To fix: Don’t evaluate on the training set, or do anything isomorphic.
Is almost certainly NOT happening due to distribution shift
To diagnose: I . . . guess you’d look at the difference between training and deployment? And see whether the inevitable apparent underfit on the training set is actually greater in the test set?
To fix: If you see this happening to a meaningful extent, your modelling context is cursed. Don’t try anything clever, just get a more relevant dataset – or a less cursed project – and don’t look back.
If highs and lows are both making the same error, and middle predictions are making the opposite error:
Could be an inappropriate linkage
To diagnose: Try other linkages and see if they work better. In particular, if you’re using additive(/unity) linkage to predict a price, and see both your highs and lows are too low, try multiplicative(/log) linkage.
To fix: If other linkages work better, use them.
Could be a bound, or other effect of output on output
To diagnose: If the best linkage still isn’t good enough, but the problem is still present in the training set, it’s probably this.
To fix: Just increase model complexity until the problem goes away. Model complexity is a complete and appropriate solution to this problem. You don’t have to do anything else. [Intended affect: hostage reading ransom note on camera.]
If zigzag:
Could be multiple bounds, or multiple other effects of output on output
To diagnose: If you’re sure it’s not noise and it consistently looks like this I have no other interpretation.
To fix: Again, just raise model complexity. [Intended affect: hostage rereading part of ransom note because kidnappers say they didn’t enunciate right the first time.]