paulfchristiano comments on A naive alignment strategy and optimism about generalization

paulfchristiano 10 Jun 2021 15:10 UTC
LW: 4 AF: 4
0
AF
I agree you have to do something clever to make the intended policy plausibly optimal.
The first part of my proposal in section 3 here was to avoid using “imitate humans,” and to instead learn a function “Answer A is unambiguously worse than answer B.” Then we update against policies only when they give unambiguously worse answers.
(I think this still has a lot of problems; it’s not obvious to me whether the problem is soluble.)