Regularisation works because including simplicity means that the function fits only the most prominent patterns, which are also the ones most likely to generalise outside of the data.
In contrast, Goodhart’s Law occurs from agents who are adversarial against the system in the sense of having different goals from the system designer. Simplicity isn’t going to fix this—in fact, simplicity might even make exploiting either.
Hm. If we look at things a little more abstractly, regularization helps because we have some goal, Y (generalization accuracy), and a proxy which we can actually measure, X (training-set accuracy). If we optimize a small set of possibilities for maximum X, we expect high X to yield high Y. If we maximize a larger set, we can get better X values, but at some point our confidence that this yields high Y values starts to drop. So, to manage this tradeoff, we add a new term Z which is our regularizer. Z has to do with the amount we have to expand our model class before we would consider the one we’re looking at (which turns out to be a good proxy for the extent to which we expect X to be a good proxy for Y for this hypothesis).
Now, optimizing a combination of X and Z, we expect to do about as well as we can.
This description abstracted away the details of “including simplicity means that the function fits only the most prominent patterns, which are also the ones most likely to generalise outside of the data”—those details are packed into why we expect X to become a worse proxy as we expand the set of options. But, regularization will work reguardless of why.
Similarly, if we are considering candidates for a job, we can expect Goodhart’s Law to be worse the more widely the job has been advertized. So there is a similar “larger pools of options make the proxy worse” phenomenon. But, there are also other phenomena. The Goodharting is worse if your criteria for hiring are publically known, and better if they are private. The Goodharting is worse if your criteria are vague and you let the applicants decide how to present evidence for them (so they can cherry-pick a way which is favorable to them) and better if the criteria are objectively stated.
More importantly, for MIRI’s concerns: in a powerful enough machine learning system, where the “models” which you’re selecting between may themselves be agents, there’s not a really clean line between the different phenomena. You might allow a sufficiently complex model (for a notion of “complex” which is unclear—not necessarily the opposite of “simple” the way you’re using it) and get Goodharted straightforwardly, IE get an agent out the other end who was able to intelligently achieve a high score on your training data but who may be less cooperative on real-life data due to being capable of noticing the context shift.
I’m not convinced this is analogous.
Regularisation works because including simplicity means that the function fits only the most prominent patterns, which are also the ones most likely to generalise outside of the data.
In contrast, Goodhart’s Law occurs from agents who are adversarial against the system in the sense of having different goals from the system designer. Simplicity isn’t going to fix this—in fact, simplicity might even make exploiting either.
Hm. If we look at things a little more abstractly, regularization helps because we have some goal, Y (generalization accuracy), and a proxy which we can actually measure, X (training-set accuracy). If we optimize a small set of possibilities for maximum X, we expect high X to yield high Y. If we maximize a larger set, we can get better X values, but at some point our confidence that this yields high Y values starts to drop. So, to manage this tradeoff, we add a new term Z which is our regularizer. Z has to do with the amount we have to expand our model class before we would consider the one we’re looking at (which turns out to be a good proxy for the extent to which we expect X to be a good proxy for Y for this hypothesis).
Now, optimizing a combination of X and Z, we expect to do about as well as we can.
This description abstracted away the details of “including simplicity means that the function fits only the most prominent patterns, which are also the ones most likely to generalise outside of the data”—those details are packed into why we expect X to become a worse proxy as we expand the set of options. But, regularization will work reguardless of why.
Similarly, if we are considering candidates for a job, we can expect Goodhart’s Law to be worse the more widely the job has been advertized. So there is a similar “larger pools of options make the proxy worse” phenomenon. But, there are also other phenomena. The Goodharting is worse if your criteria for hiring are publically known, and better if they are private. The Goodharting is worse if your criteria are vague and you let the applicants decide how to present evidence for them (so they can cherry-pick a way which is favorable to them) and better if the criteria are objectively stated.
More importantly, for MIRI’s concerns: in a powerful enough machine learning system, where the “models” which you’re selecting between may themselves be agents, there’s not a really clean line between the different phenomena. You might allow a sufficiently complex model (for a notion of “complex” which is unclear—not necessarily the opposite of “simple” the way you’re using it) and get Goodharted straightforwardly, IE get an agent out the other end who was able to intelligently achieve a high score on your training data but who may be less cooperative on real-life data due to being capable of noticing the context shift.