I’m guessing such reward functions would be used to detect something like model splintering?
Deep Reinforcement Learning from Human Preferences uses an ensemble of reward models, prompting the user for more feedback at the certain thresholds of disagreement among the models.
Whether this ensemble would be diverse enough to learn both “go right” and “go to coin” is unclear. Traditional “predictive” diversity metrics probably wouldn’t help (the whole problem is that the coin and the right wall reward models would predict the same reward on the training distribution), but using some measure of internal network diversity (i.e. differences in internal representations) might work.
Hey there! Sorry for the delay. $100 for the best reference. PM for more details.