Donald Hobson comments on A naive alignment strategy and optimism about generalization

Donald Hobson 10 Jun 2021 12:52 UTC
LW: 4 AF: 3
AF
There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).
- paulfchristiano 10 Jun 2021 15:10 UTC
  LW: 4 AF: 4
  AF Parent
  I agree you have to do something clever to make the intended policy plausibly optimal.
  The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
  (I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)