Tangentially relevant: Armstrong has a post on some general features that make it likely that Goodharting will be a problem. Conversely, he gives some features that make it less likely you’ll wind up Goodharting in your search.
So, as long as:
We use a Bayesian mix of reward functions rather than a maximum likelihood reward function.
An ideal reward function is present in the space of possible reward functions, and is not penalised in probability.
The different reward functions are normalised.
If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process.
Then, we shouldn’t unduly fear Goodhart effects (of course, we still need to incorporate as much as possible about our preferences into the AI’s learning process). The second problem, making sure that there are no unusual penalties for ideal reward functions, seems the hardest to ensure.
Tangentially relevant: Armstrong has a post on some general features that make it likely that Goodharting will be a problem. Conversely, he gives some features that make it less likely you’ll wind up Goodharting in your search.