I don’t see a reason not to try, though I wonder if this doesn’t lead back to the underlying problem and reward hacking i.e. can we actually design a reward that couldn’t be exploited by say a trajectory that only looks benevolent?
I don’t see a reason not to try, though I wonder if this doesn’t lead back to the underlying problem and reward hacking i.e. can we actually design a reward that couldn’t be exploited by say a trajectory that only looks benevolent?