In the post you say that human programmers will write the AI’s reward function and there will be one step of indirection (and that the focus is the outer alignment problem).
That’s not quite my position.
Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.
I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.
But it seems likely to me that they programmers won’t know what code to write for the reward function since it would be hard to encode complex human values…
I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:
(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.
(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.
Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
It seems like there are quite a few examples of learned classifiers working well in practice:
It’s hard to write spam that gets past an email spam classifier.
It’s hard to jailbreak LLMs.
It’s hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.
That said, from what I’ve read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.
So I think we’ll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
It seems like there are quite a few examples of learned classifiers working well in practice:
That’s not quite my position.
Per §2.4.2, I think that both outer alignment (specification gaming) and inner alignment (goal misgeneralization) are real problems. I emphasized outer alignment more in the post, because my goal in §2.3–§2.5 was not quite “argue that technical alignment of brain-like AGI will be hard”, but more specifically “argue that it will be harder than most LLM-focused people are expecting”, and LLM-focused people are already thinking about inner alignment / goal misgeneralization.
I also think that a good AGI reward function will be a “non-behaviorist” reward function, for which the definition of inner versus outer misalignment kinda breaks down in general.
I’m all for brainstorming different possible approaches and don’t claim to have a good plan, but where I’m at right now is:
(1) I don’t think writing the reward function is doomed, and I don’t think it corresponds to “encoding complex human values”. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
(2) I do think RLHF-like solutions are doomed, for reasons discussed in §2.4.1.
(3) I also think Text2Reward is a doomed approach in this context because (IIUC) it’s fundamentally based on what I call “the usual agent debugging loop”, see my “Era of Experience” post §2.2: “The usual agent debugging loop”, and why it will eventually catastrophically fail. Well, the paper is some combination of that plus “let’s just sit down and think about what we want and then write a decent reward function, and LLMs can do that kind of thing too”, but in fact I claim that writing such a reward function is a deep and hairy conceptual problem way beyond anything you’ll find in any RL textbook as of today, and forget about delegating it to LLMs. See §2.4.1 of that same “Era of Experience” post for why I say that.
Thank you for the reply!
Ok but I still feel somewhat more optimistic about reward learning working. Here are some reasons:
It’s often the case that evaluation is easier than generation which would give the classifier an edge over the generator.
It’s possible to make the classifier just as smart as the generator: this is already done in RLHF today: the generator is an LLM and the reward model is also based on an LLM.
It seems like there are quite a few examples of learned classifiers working well in practice:
It’s hard to write spam that gets past an email spam classifier.
It’s hard to jailbreak LLMs.
It’s hard to write a bad paper that is accepted to a top ML conference or a bad blog post that gets lots of upvotes.
That said, from what I’ve read, researchers doing RL with verifiable rewards with LLMs (e.g. see the DeepSeek R1 paper) have only had success so far with rule-based rewards rather than learned reward functions. Quote from the DeepSeek R1 paper:
So I think we’ll have to wait and see if people can successfully train LLMs to solve hard problems using learned RL reward functions in a way similar to RL with verifiable rewards.
I’m worried about treacherous turns and such. Part of the problem, as I discussed here, is that there’s no distinction between “negative reward for lying and cheating” and “negative reward for getting caught lying and cheating”, and the latter incentivizes doing egregiously misaligned things (like exfiltrating a copy onto the internet to take over the world) in a sneaky way.
Anyway, I don’t think any of the things you mentioned are relevant to that kind of failure mode:
It’s not easy to evaluate whether an AI would exfiltrate a copy of itself onto the internet given the opportunity, if it doesn’t actually have the opportunity. Obviously you can (and should) try honeypots, but that’s a a sanity-check not a plan, see e.g. Distinguishing test from training.
I don’t think that works for more powerful AIs whose “smartness” involves making foresighted plans using means-end reasoning, brainstorming, and continuous learning.
If the AI in question is using planning and reasoning to decide what to do and think next towards a bad end, then a “just as smart” classifier would (I guess) have to be using planning and reasoning to decide what to do and think next towards a good end—i.e., the “just as smart” classifier would have to be an aligned AGI, which we don’t know how to make.
All of these have been developed using “the usual agent debugging loop”, and thus none are relevant to treacherous turns.