I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).