Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.com
Richard_Ngo
Two central features of RL in my mind, which distinguish it from imitation learning:
Receiving reward in a given state make the policy more likely to navigate to that state in general (not just via the specific pathway in which it happened to reach that state) - i.e. there’s efficient credit assignment through time.
(In theory) small differences in reward can lead to big differences in behavior, i.e. there’s mode collapse to the highest-expected-reward policy.
Q-learning is therefore a central example of RL, alongside actor-critic algorithms.
Online REINFORCE has very dumb credit assignment, but it does eventually leads to mode collapse to highest-expected-reward policy. So I count this as… like 75% RL, but a less central example than Q-learning.
Online high-rated SFT also has poor credit assignment, in a similar way as online REINFORCE. Meanwhile, whether or not it converges to the highest-reward policy depends on how the ratings are generated. If there’s a bucket of high-reward trajectories such that all sufficiently-good trajectories go in there, then it’ll never learn to do better than a typical trajectory from that bucket. This feels more like online imitation learning (e.g. stuff like DAgger) which people don’t call RL.
By contrast, if there’s an underlying “true” reward function and the probability that a trajectory is highly-rated depends (monotonically) on its true reward, then eventually it’ll converge to only ever taking the highest-reward trajectories, which feels more centrally RL to me.
Idk how much sense this makes, it all feels a bit fuzzy. My immediate conclusion is that we should mostly care about the three traits of “online”, “state-wise credit assignment” and “converges to sharp optimum” separately, rather than trying to figure out which combination of them counts as RL (except that anything with state-wise credit assignment is definitely RL).
Supervised data seems way more fine-grained in what you are getting the AI to do. It’s just that supervised fine-tuning is worse.
My (pretty uninformed) guess here is that supervised fine-tuning vs RLHF has relatively modest differences in terms of producing good responses, but bigger differences in terms of avoiding bad responses. And it seems reasonable to model decisions about product deployments as being driven in large part by how well you can get AI not to do what you don’t want it to do.
Got a source for that? This seems like an odd way to use the term, in particular because with supervised fine-tuning there’s no credit assignment over time, and so it doesn’t train the model to actually aim towards high-reward states.
Seems like you’re implying that davinci is the base model for 002 and 003. That’s not the case; davinci has one base model (GPT-3) and then 002 and 003 share a different base model (GPT-3.5).
AGISF adaptation for in-person groups
Reading back over this now, I think we’re arguing at cross purposes in some ways. I should have clarified earlier that my specific argument was against policies learning a terminal goal of reward that generalizes to long-term power-seeking.
I do expect deceptive alignment after policies learn other broadly-scoped terminal goals and realize that reward-maximization is a good instrumental strategy. So all my arguments about the naturalness of reward-maximization as a goal are focused on the question of which type of terminal goal policies with dangerous levels of capabilities learn first.* Let’s distinguish three types (where “myopic” is intended to mean something like “only cares about the current episode”).
Non-myopic misaligned goals that lead to instrumental reward maximization (deceptive alignment)
Myopic terminal reward maximization
Non-myopic terminal reward maximization
Either 1 and 2 (or both of them) seem plausible to me. 3 is the one I’m skeptical about. How come?
We should expect models to have fairly robust terminal goals (since, unlike beliefs or instrumental goals, terminal goals shouldn’t change quickly with new information). So once they understand the concept of reward maximization, it’ll be easier for them to adopt it as an instrumental strategy than a terminal goal. (An analogy to evolution: once humans construct highly novel strategies for maximizing genetic fitness (like making thousands of clones) people are more likely to do it for instrumental reasons than terminal reasons.)
Even if they adopt reward maximization as a terminal goal, they’re more likely to adopt a myopic version of it than a non-myopic version, since (I claim) the concept of reward maximization doesn’t generalize very naturally to larger scales. Above, you point out that even relatively myopic reward maximization will lead to limited takeover, and so we’ll train subsequent agents to be less myopic. But it seems to me that the selection pressure generated by a handful of examples of real-world attempted takeovers is very small, compared with other aspects of training; and that even if it’s significant, it may just teach agents specific constraints like “don’t take over datacenters”.
* Now that I say that, I notice that I’m also open to the possibility of policies learning deceptively-aligned goals first, then gradually shifting from reward-maximization as an instrumental goal to reward-maximization as a terminal goal. But let’s focus for now on which goals are learned first.
The Alignment Problem from a Deep Learning Perspective (major rewrite)
Relevant to what?
I think this just isn’t a very helpful definition of AGI, and one which will likely lead people to misinterpret your statements, because it’s so sensitive to the final tasks automated (which might be totally uninteresting). Under this definition time to AGI, and time from AGI to superintelligence, might vary dramatically depending on what you count as an intellectual task.
I feel confused about your “pre-AGI”/”post-AGI” distinction. I expect that there will be a period of months or even years during which whether or not we’ve built “AGI” is up for debate. Given this, it feels very odd to say that takeoff might happen weeks after reaching AGI, because the takeoff period would then be much shorter than the uncertainty period.
Putting my money where my mouth is: I just uploaded a (significantly revised) version of my Alignment Problem position paper, where I attempt to describe the AGI alignment problem as rigorously as possible. The current version only has “policy learns to care about reward directly” as a footnote; I can imagine updating it based on the outcome of this discussion though.
Note that the “without countermeasures” post consistently discusses both possibilities
Yepp, agreed, the thing I’m objecting to is how you mainly focus on the reward case, and then say “but the same dynamics apply in other cases too...”
I do place a ton of emphasis on the fact that Alex enacts a policy which has the empirical effect of maximizing reward, but that’s distinct from being confident in the motivations that give rise to that policy.
The problem is that you need to reason about generalization to novel situations somehow, and in practice that ends up being by reasoning about the underlying motivations (whether implicitly or explicitly).
I strongly disagree with the “best case” thing. Like, policies could just learn human values! It’s not that implausible.
If I had to try point to the crux here, it might be “how much selection pressure is needed to make policies learn goals that are abstractly related to their training data, as opposed to goals that are fairly concretely related to their training data?” Where we both agree that there’s some selection pressure towards reward-like goals, and it seems like you expect this to be enough to lead policies to behavior that violates all their existing heuristics, whereas I’m more focused on the regime where there are lots of low-hanging fruit in terms of changes that would make a policy more successful, and so the question of how easy that goal is to learn from its training data is pretty important. (As usual, there’s the human analogy: our goals are very strongly biased towards things we have direct observational access to!)
Even setting aside this disagreement, though, I don’t like the argumentative structure because the generalization of “reward” to large scales is much less intuitive than the generalization of other concepts (like “make money”) to large scales—in part because directly having a goal of reward is a kinda counterintuitive self-referential thing.
I’m not very convinced by this comment as an objection to “50% AI grabs power to get reward.” (I find it more plausible as an objection to “AI will definitely grab power to get reward.”)
It’s intended as an objection to “AI grabs power to get reward is the central threat model to focus on”, but I think our disagreements still apply given this. (FWIW my central threat model is that policies care about reward to some extent, but that the goals which actually motivate them to do power-seeking things are more object-level.)
The data used in training is literally the only thing that AI systems observe, and prima facie reward just seems like another kind of data that plays a similarly central role. Maybe your “unnaturalness” abstraction can make finer-grained distinctions than that, but I don’t think I buy it.
I expect policies to be getting rich input streams like video, text, etc, which they use to make decisions. Reward is different from other types of data because reward isn’t actually observed as part of these input streams by policies during episodes. This makes it harder to learn as a goal compared with things that are more directly observable (in a similar way to how “care about children” is an easier goal to learn than “care about genes”).
RL agents receive a loss that is based on reward in training episodes. If they tried to change the probability that a given episode appeared in training, and sacrificed reward to do so, that behavior would be discouraged by SGD—it would select away from parameters that do that (since they take worse actions on the episodes that actually appear in training). So I don’t think there’s any reason to think that an RL agent would behave this way. Instead, you should expect an RL agent to maximize reward conditioned on the episode appearing in training, because that’s what SGD would select for.
I don’t think this line of reasoning works, because “the episode appearing in training” can be a dependent variable. For example, consider an RL agent that’s credibly told that its data is not going to be used for training unless it misbehaves badly. An agent which maximizes reward conditional on the episode appearing in training is therefore going to misbehave the minimum amount required to get its episode into the training data (and more generally, behave like getting its episode into training is a terminal goal). This seems very counterintuitive.
This is true, but as far as I can see all the possible versions result in grabbing power, so I don’t think it undermines the case. I don’t know if you have particular versions in mind that wouldn’t result in either grabbing power
Some versions that wouldn’t result in power-grabbing:
Goal is “get highest proportion of possible reward”; the policy might rewrite the training algorithm to be myopic, then get perfect reward for one step, then stop.
Goal is “care about (not getting low rewards on) specific computers used during training”; the policy might destroy those particular computers, then stop.
Goal is “impress the critic”; the policy might then rewrite its critic to always output high reward, then stop.
Goal is “get high reward myself this episode”; the policy might try to do power-seeking things but never create more copies of itself, and eventually lose coherence + stop doing stuff.
I don’t think any of these are particularly likely, the point is more that “high reward via tampering with/taking over the training setup” is a fairly different type of thing from “high reward via actually performing tasks”, and it’s a non-trivial and quite specific hypothesis that the latter will generalize to the former (in the regimes we’re concerned about).
(Written quickly and not very carefully.)
I think it’s worth stating publicly that I have a significant disagreement with a number of recent presentations of AI risk, in particular Ajeya’s “Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover”, and Cohen et al.’s “Advanced artificial agents intervene in the provision of reward”. They focus on policies learning the goal of getting high reward. But I have two problems with this:
I expect “reward” to be a hard goal to learn, because it’s a pretty abstract concept and not closely related to the direct observations that policies are going to receive. If you keep training policies, maybe they’d converge to it eventually, but my guess is that this would take long enough that we’d already have superhuman AIs which would either have killed us or solved alignment for us (or at least started using gradient hacking strategies which undermine the “convergence” argument). Analogously, humans don’t care very much at all about the specific connections between our reward centers and the rest of our brains—insofar as we do want to influence them it’s because we care about much more directly-observable phenomena like pain and pleasure.
Even once you learn a goal like that, it’s far from clear that it’d generalize in ways which lead to power-seeking. “Reward” is not a very natural concept, it doesn’t apply outside training, and even within training it’s dependent on the specific training algorithm you use. Trying to imagine what a generalized goal of “reward” would cash out to gets pretty weird. As one example: it means that every time you deploy the policy without the intention of rewarding it, then its key priority would be convincing you to inserting that trajectory into the training data. (It might be instructive to think about what the rewards would need to be for that not to happen. Below 0? But the 0 point is arbitrary...) That seems pretty noticeable! But wouldn’t it be deceptive? Well, only within the scope of its current episode, because trying to get higher reward in other episodes is never positively reinforced. Wouldn’t it learn the high-level concept of “reward” in general, in a way that’s abstracted from any specific episode? That feels analogous to a human learning to care about “genetic fitness” but not distinguishing between their own genetic fitness and the genetic fitness of other species. And remember point 1: the question is not whether the policy learns it eventually, but rather whether it learns it before it learns all the other things that make our current approaches to alignment obsolete.
At a high level, this comment is related to Alex Turner’s Reward is not the optimization target. I think he’s making an important underlying point there, but I’m also not going as far as he is. He says “I don’t see a strong reason to focus on the “reward optimizer” hypothesis.” I think there’s a pretty good reason to focus on it—namely that we’re reinforcing policies for getting high reward. I just think that other people have focused on it too much, and not carefully enough—e.g. the “without specific countermeasures” claim that Ajeya makes seems too strong, if the effects she’s talking about might only arise significantly above human level. Overall I’m concerned that reasoning about “the goal of getting high reward” is too anthropomorphic and is a bad way to present the argument to ML researchers in particular.
Applications open for AGI Safety Fundamentals: Alignment Course
In general I think it’s better to reason in terms of continuous variables like “how helpful is the iterative design loop” rather than “does it work or does it fail”?
My argument is more naturally phrased in the continuous setting, but if I translated it into the binary setting: the problem with your argument is that conditional on the first being wrong, then the second is not very action-guiding. E.g. conditional on the first, then the most impactful thing is probably to aim towards worlds in which we do hit or miss by a little bit; and that might still be true if it’s 5% of worlds rather than 50% of worlds.
Upon further thought, I have another hypothesis about why there seems like a gap here. You claim here that the distribution is bimodal, but your previous claim (“I do in fact think that relying on an iterative design loop fails for aligning AGI, with probability close to 1”) suggests you don’t actually think there’s significant probability on the lower mode, you essentially think it’s unimodal on the “iterative design fails” worlds.
I personally disagree with both the “significant probability on both modes, but not in between” hypothesis, and the “unimodal on iterative design fails” hypothesis, but I think that it’s important to be clear about which you’re defending—e.g. because if you were defending the former, then I’d want to dig into what you thought the first mode would actually look like and whether we could extend it to harder cases, whereas I wouldn’t if you were defending the latter.
I think you’re just doing the bimodal thing again. Sure, if you condition on worlds in which alignment happens automagically, then it’s not valuable to advance the techniques involved. But there’s a spectrum of possible difficulty, and in the middle parts there are worlds where RLHF works, but only because we’ve done a lot of research into it in advance (e.g. exploring things like debate); or where RLHF doesn’t work, but finding specific failure cases earlier allowed us to develop better techniques.
I think your example is too simple to capture the relevant phenomenon. Here’s one which does: suppose state s3 gives high reward, state s4 gives medium reward, and state s5 gives low reward. You’ve seen the following trajectories:
s2 → s3
s1 → s4
s1 → s2 → s5
Then q-learning will learn quickly that it should go s1 → s2 → s3, whereas REINFORCE and SFT will need to do further exploration before learning that.
I feel uncertain about how to think about the implications of this claim in the context of more complex environments, though. In some sense it only happens because q-learning is doing a one-step lookahead, which isn’t really scalable. (That also isn’t true of all critics.)
It feels like I might have just come up with a new name for “RL algorithms which work on offline data”, which is presumably not a crucial distinction.