In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used.
I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabilities to achieve longer-term goals?
(I haven’t read the “conditioning generative models“ sequence, will probably read it. Thank you)
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).
In RLHF, the gradient descent will steer the model towards being more agentic about the entire output (and, speculatively, more context-aware), because that’s the best way to produce a token on a high-reward trajectory. The lowest loss is achievable with a superintelligence that thinks about a sequence of tokens that would be best at hacking human brains (or a model of human brains) and implements it token by token.
That seems quite different from a GPT that focuses entirely on predicting the current token and isn’t incentivized to care about the tokens that come after the current one outside of what the consequentialists writing (and selecting, good point) the text would be caring about. At the lowest loss, the GPT doesn’t use much more optimization power than what the consequentialists plausibly had/used.
I have an intuition about a pretty clear difference here (and have been quite sure that RL breaks myopia for a while now) and am surprised by your expectation for myopia to be preserved. RLHF means optimizing the entire trajectory to get a high reward, with every choice of every token. I’m not sure where the disagreement comes from; I predict that if you imagine fine-tuning a transformer with RL on a game where humans always make the same suboptimal move, but they don’t see it, you would expect the model, when it becomes smarter and understands the game well enough, to start picking instead a new move that leads to better results, with the actions selected for what results they produce in the end. It feels almost tautological: if the model sees a way to achieve a better long-term outcome, it will do that to score better. The model will be steered towards achieving predictably better outcomes in the end. The fact that it’s a transformer that individually picks every token doesn’t mean that RL won’t make it focus on achieving a higher score. Why would the game being about human feedback prevent a GPT from becoming agentic and using more and more of its capabilities to achieve longer-term goals?
(I haven’t read the “conditioning generative models“ sequence, will probably read it. Thank you)
I also don’t know where the disagreement comes from. At some point I am interested in engaging with a more substantive article laying out the “RLHF --> non-myopia --> treacherous turn” argument so that it can be discussed more critically.
Yes, of course such a model will make superhuman moves (as will GPT if prompted on “Player 1 won the game by making move X”), while a model trained to imitate human moves will continue to play at or below human level (as will GPT given appropriate prompts).
But I think the thing I’m objecting to is a more fundamental incoherence or equivocation in how these concepts are being used and how they are being connected to risk.
I broadly agree that RLHF models introduce a new failure mode of producing outputs that e.g. drive humane valuators insane (or have transformative effects on the world in the course of their human evaluation). To the extent that’s all you are saying we are in agreement, and my claim is just that it doesn’t really challenge Peter’s summary or (or represent a particularly important problem for RLHF).