There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)