In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.
In reward learning research, it’s common to represent the AI’s estimate of the true reward function as a distribution over possible reward functions, which I think is analogous to what you are describing. It’s also common to define optimal behavior, given a distribution over reward functions, as that behavior which maximizes the expected reward under that distribution. This is mathematically equivalent to optimizing a single reward function equal to the expectation of the distribution. So, this helps in that the AI is optimizing a reward function that is more likely to be “aligned” than one at an extreme end of the distribution. However, this doesn’t help with the problems of optimizing a single fixed reward function.