You can try to have feedback separately on the ‘ultimate desirability’ of consequences and the ‘practical usefulness’ of actions, where you build the consequence-prediction model solely from experimental data and the value-estimation model solely from human feedback. I think this runs into serious issues because humans have to solve the mixed problem, not the split problem, and so it will be difficult for humans to give well-split training data.
As well, having a solution that’s “real but expensive” would be a real step up from having no solution!
You can try to have feedback separately on the ‘ultimate desirability’ of consequences and the ‘practical usefulness’ of actions, where you build the consequence-prediction model solely from experimental data and the value-estimation model solely from human feedback. I think this runs into serious issues because humans have to solve the mixed problem, not the split problem, and so it will be difficult for humans to give well-split training data.
As well, having a solution that’s “real but expensive” would be a real step up from having no solution!