johnswentworth comments on Externalized reasoning oversight: a research direction for language model alignment

johnswentworth 5 Aug 2022 1:58 UTC
LW: 3 AF: 3
1
AF
Now, the usual way to get around that problem is to have the system mimic humans. Then, the argument goes, the system’s answers will be about as good as the humans would have figured out anyway, maybe given a bit more time. (Not a lot more time, because extrapolating too far into the future is itself failure-prone, since we push further and further out of the training distribution as we ask for hypothetical text from further and further into the future.) This is a fine plan, but inherently limited in terms of how much it buys us. It mostly works in worlds where we were already close to solving the problem ourselves, and the further we were from solving it, the less likely that the human-mimicking AI will close the gap. E.g. if I already have most of the pieces, then GPT-6 is much more likely to generate a workable solution when prompted to give “John Wentworth’s Solution To The Alignment Problem” from the year 2035, whereas if I don’t already have most of the pieces then that is much less likely to work. And that’s true even if I will eventually figure out the pieces; the problem is that the pieces I haven’t figured out yet aren’t in the training distribution. So mostly, the way to increase the chances of that strategy working is by getting closer to solving the problem ourselves.