I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)
I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)