My thoughts on OpenAI’s alignment plan

Epistemic Status: This is my first attempt at writing up my thoughts on an alignment plan. I spent about a week on it.

I’m grateful to Olivia Jimenez, Thomas Larsen, and Nicholas Dupuis for feedback.

A few months ago, OpenAI released its plan for alignment. More recently, Jan Leike (one of the authors of the original post) released a blog post about the plan, and Eliezer & Nate encouraged readers to write up their thoughts.

In this post, I cover some thoughts I have about the OpenAI plan. This is a long post, and I’ve divided it into a few sections. Each section gets increasingly more specific and detailed. If you only have ~5 minutes, I suggest reading section 1 and skimming section 2.

The three sections:

  1. An overview of the plan and some of my high-level takes (here)

  2. Some things I like about the plan, some concerns, and some open questions (here)

  3. Specific responses to claims in OpenAI’s post and Jan’s post (here)

Section 1: High-level takes

Summary of OpenAI’s Alignment Plan

As I understand it, OpenAI’s plan involves using reinforcement learning from human feedback and recursive reward modeling to build AI systems that are better than humans at alignment research. Open AI’s plan is not aiming for a full solution to alignment (that scales indefinitely or that could work on a superintelligent system). Rather, the plan is intended to (a) build systems that are better at alignment research than humans (AI assistants), (b) use these AI assistants to accelerate alignment research, and (c) use these systems to build/​align more powerful AI assistants.

Six things that need to happen for the plan to work

For the plan to work, I think it needs to get through the following 6 steps:

  1. OpenAI builds LLMs that can help us with alignment research

  2. OpenAI uses those models primarily to help us with alignment research, and we slow down/​stop capabilities research when necessary

  3. OpenAI has a way to evaluate the alignment strategies proposed by the LLM

  4. OpenAI has a way to determine when it’s OK to scale up to more powerful AI assistants (and OpenAI has policies in place that prevent people from scaling up before it’s OK)

  5. Once OpenAI has a highly powerful and aligned system, they do something with it that gets us out of the acute risk period

  6. Once we are out of the acute risk period, OpenAI has a plan for how to use transformative AI in ways that allow humanity to “fulfill its potential” or “achieve human flourishing” (and they have a plan to figure out what we should even be aiming for)

If any of these steps goes wrong, I expect the plan will fail to avoid an existential catastrophe. Note that steps 1-5 must be completed before another actor builds an unaligned AGI.

Here’s a table that lists each step and the extent to which I think it’s covered by the two posts about the OpenAI plan:


Extent to which this step is discussed in the OpenAI plan

Build an AI assistant

Adequate; OpenAI acknowledges that this is their goal, mentions RLHF and RRM as techniques that could help us achieve this goal, and mentions reasonable limitations

Use the assistant for alignment research (and slow down capabilities)

Inadequate; OpenAI mentions that they will use the assistant for alignment research, but they don’t describe how they will slow down capabilities research if necessary or how they plan to shift the current capabilities-alignment balance

Evaluate the alignment strategies proposed by the assistant

Unclear; OpenAI acknowledges that they will need to evaluate strategies from the AI assistant, though the particular metrics they mention seem unlikely to detect alignment concerns that may only come up at high capability levels (e.g., deception, situational awareness)

Figure out when it’s OK to scale up to more powerful AI assistant

Unaddressed; the current OpenAI plan doesn’t include a way to handle this step

Use AI to get us out of the acute risk period

Unaddressed; the current OpenAI plan doesn’t include a way to address this step

Figure out what humanity wants the future to look like & use AI to achieve a positive future

Unaddressed; the current OpenAI plan doesn’t include a way to handle this step

Section 2: Likes, concerns, and open questions

What I like about the plan

There are a few specific things that I like about this plan (and some of the underlying logic that supports it). First, I think it’s reasonable to aim for producing aligned AI assistants. This goal is more tractable than trying to come up with a solution to the full alignment problem, and OpenAI seems especially well-positioned to perform work that figures out how to use language models to advance alignment research. Second, I agree with Jan’s “conviction in language models”: it seems plausible that language models will scale to artificial general intelligence (or “narrow” systems that are still powerful enough to transform the world). On the margin, I’d be excited for more alignment researchers to think seriously about how AI research assistants could be developed and deployed in ways that maximize their impact on alignment research while minimizing their impact on accelerating AGI timelines. Third, I strongly agree with the sentiment that the burden of proof should be on “showing that a new system is sufficiently aligned, and we cannot shift the burden of proof to showing the situation has changed compared to earlier systems.”

Overview of my main concerns

I also have some concerns with the plan. First, I worry that AI assistants that are capable of meaningfully helping with alignment research will also be able to meaningfully advance AI capabilities research, and labs will have strong incentives to use these research assistants to make even more powerful AI systems.

Second, and relatedly, I am concerned about race dynamics between major AI labs. Given that there are multiple AI projects, there are a few important problems: (a) each actor will be incentivized to advance capabilities in order to stay competitive with others, (b) the situation produces a unilateralist curse, in which only one actor needs to make a mistake in order to end the world, and (c) the fear that others will make a mistake will further incentivize labs to advance capabilities (“we can’t let the other less safety-conscious labs get ahead of us.”)

Third, OpenAI’s plan relies on using clear metrics that allow us to evaluate alignment progress. However, the metrics that Jan mentions are unlikely to scale to powerful systems that develop incentives to deceive us and play the training game.

Fourth, OpenAI’s plan relies on the leading AGI lab scoring well on many dimensions of operational adequacy. The lab implementing OpenAI’s plan will have to make many challenging decisions about (a) how far to push capabilities to develop a useful AI assistant, (b) when to slow down capabilities, (c) how to use the research assistant, (d) how to evaluate ideas from the AI assistant, and (e) when to build/​deploy models that may be powerful enough to overpower humanity. To execute this plan successfully, I expect it will require labs to have leaders with an unusually-strong security mindset, unusually-good information security and operations security (to prevent important insights from leaking), and an unusually-strong safety culture.

Questions about the plan

With this in mind, there are some questions that, if addressed, would make me more optimistic about OpenAI’s plan. I would be excited to see these addressed by OpenAI, other labs, AI strategy/​governance researchers, or others who are interested in AI-assisted alignment agendas.

Balance between capabilities research and alignment research

  1. Does OpenAI intend to maintain its current balance between capabilities research and alignment research? At what point would OpenAI slow down capabilities research and shift the capabilities-alignment balance? For example, if we develop an AI assistant that is capable of 3Xing research productivity, would this tool simply perpetuate the current balance between capabilities research and alignment research?

  2. What process will be used to determine whether or not to “speed ahead” or “slow down” capabilities research? For example, if GPT-N is an AI assistant that can substantially boost research productivity, what procedure will be used to determine whether we should try to build GPT-N+1? Who will be responsible for making these decisions, and what processes will they use?

  3. How will OpenAI respond to developments from other AGI labs? What would happen if an actor that was perceived as “not safety-conscious” was revealed to be slightly behind (or ahead of) OpenAI? What would happen if there was a partially-substantiated rumor along these lines? Under what circumstances would OpenAI decide to “speed ahead”, and under what circumstances is it committed to “slowing down?”

Metrics, evaluation, and iteration

  1. What specific metrics will help us know that our alignment strategies are working (and are likely to keep working for stronger models)?

  2. If these metrics aren’t robust to discontinuities, sharp left turns, and deception, how will we either (a) develop metrics that are robust to these concerns or (b) perform experiments that rule out the possibility of these concerns?

  3. If AI assistants produce plans, how will we evaluate these plans? Who will be responsible for making these decisions, and what processes will they use? To what extent will safety-conscious people outside of OpenAI be consulted?


  1. What questions would we plan to ask our first AI assistant?

  2. How would OpenAI ensure sufficient information siloing, operations security, and other dimensions of operational adequacy?

  3. How will OpenAI balance the desire to use systems to improve the world (e.g., build and deploy a system capable of performing medical research) with the desire to not launch potentially-world-ending systems? How will OpenAI decide which systems to deploy or share?

  4. Assuming OpenAI develops an aligned and highly intelligent system, what does it plan to do with this system? How will it protect the world from the development of unaligned AIs and other existential catastrophes? Will OpenAI commit to producing a long reflection that aims to produce coherent extrapolated volition, or do they have a different plan?

For the rest of the post, I expand on some of the areas where I agree with the plan (or Jan’s post) and disagree with the plan (or have a critique/​uncertainty that reduces my optimism).

Section 3: Responses to specific claims

Things I agree with (or generally support)

Stylistic note: Statements in the numbered list are summaries of points made in the posts by Jan/​OpenAI. Occasionally, I add direct quotes.

  1. We do not currently have a solution to the alignment problem that scales indefinitely.

“There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.”

2. We do not need to come up with a full solution to alignment that scales indefinitely. Instead, we can aim to achieve the “modest” goal of aligning systems that are sufficiently better/​faster at alignment research than humans (such that they can align more capable successor systems).

3. The burden of proof should be on making sure that a new system is aligned. This is especially true when the fate of humanity is on the line.

“The burden of proof is always on showing that a new system is sufficiently aligned and we cannot shift the burden of proof to showing that the situation has changed compared to earlier systems.”

4. Empirical findings thus far should be interpreted cautiously. Many threat models of AI x-risk emphasize that the most difficult and dangerous problems (e.g., instrumentally convergent goals, deception, and power-seeking) will only emerge at high levels of capabilities.

“Let’s not get carried away by this evidence. Just because it has been favorable so far, doesn’t mean it will continue to be. AI systems aren’t yet smarter than us, so we’re not facing the real problems yet.”

5. The parts of the plan that rely heavily on empirical research and iteration will not work if there is a sharp left turn (or other major discontinuities in AI progress)

“If there are major discontinuities or paradigm shifts, then most lessons learned from aligning models like InstructGPT might not be directly useful.”

6. The least capable models that can (meaningfully) help with alignment research might already be dangerous.

“It might not be fundamentally easier to align models that can meaningfully accelerate alignment research than it is to align AGI. In other words, the least capable models that can help with alignment research might already be too dangerous if not properly aligned. If this is true, we won’t get much help from our own systems for solving alignment problems.”

7. It should be easier to align “narrower” AI systems than to align fully general AGI.

“Importantly, we only need “narrower” AI systems that have human-level capabilities in the relevant domains to do as well as humans on alignment research. We expect these AI systems are easier to align than general-purpose systems or systems much smarter than humans.”

8. The fact that we may achieve AGI through large language models is good news (compared to the world in which we would have achieved AGI through deep RL agents that were maximizing score functions). I’m sympathetic to the arguments that RL produces agents, while self-supervised learning produces systems that look more like simulators, and agents might be more difficult to align.

9. Evaluation is generally easier than generation. We should expect this to be true for at least some kinds of alignment research.

10. Language models may be sufficient to get to AGI.

11. There might be ways to make AI assistants that are differentially useful for alignment research.

“Compared to ML research, alignment research is much more pre-paradigmatic and needs to sort out its fundamentals. The kind of tasks that help crystallize what the right paths, concepts, formalisms and cognitive tools are would be more differentially helpful to alignment.”

In addition to the reason Jan provides (above), it seems like some types of alignment research require a different “type of thinking” compared to standard ML research. Alignment research may be bottlenecked on things like “lacking theoretical insights” and “not having the right frame on the problem” as opposed to things like “not having enough iteration” or “not having enough time to run experiments.” Insofar as the type of cognition required to do alignment research is different from the type of cognition required to do capabilities research, it seems at least plausible that we could develop a system that was better at alignment research than capabilities research.

Disagreements, critiques, & uncertainties

Stylistic note: Statements in the numbered list are statements that I endorse. Quotes are statements made by Jan/​OpenAI that I disagree with or have uncertainties about.

  1. By default, powerful AI research assistants will be used primarily for capabilities research.

“Once we have an automated alignment researcher, the most important and urgent research will be to make its successor more aligned than itself.”

In an ideal world, we would get an automated alignment researcher, and then humanity would decide to prioritize alignment research. We would only build more powerful systems if we had guarantees that the successor systems would be aligned. We would be willing to pay considerable costs (e.g., slowing capabilities progress down by years) in order to ensure that we have alignment solutions that will work on stronger models.

The OpenAI plan only works if the leading labs choose to use those AI assistants differentially for alignment progress.

In practice, I do not expect this to happen. There are a few reasons why:

A. In the status quo, the balance between capabilities and alignment vastly favors capabilities: current AGI companies have a track record of spending more resources on capabilities research than alignment research. . Currently, OpenAI employs about 100 researchers who work primarily on capabilities and 30 researchers who work primarily on alignment. Estimates from 80,000 Hours suggest that there are only ~300 researchers working full-time on AI alignment in the world, and AI capabilities research receives ~1000X more funding than AI safety work. Therefore, by default, I would expect AI assistants to primarily help with capabilities research. For OpenAI’s plan to work, I expect that there would need to be major changes in the balance between capabilities research and alignment research.

B. Race dynamics may worsen this balance and increase incentives for capabilities research: I expect race dynamics to get worse as we get closer to AGI. It seems plausible that a few months after OpenAI gets powerful research assistants, another lab will catch up. This places OpenAI in a difficult position: the more cautious they are (i.e., the less they use their assistant to advance capabilities research), the more likely they are to be overtaken by another (potentially less safety-conscious) lab. Even the perception of race dynamics may pressure labs into advancing capabilities (when certain alignment solutions are still shaky-at-best).

i. I’ll note, however, that it is plausible that race dynamics become less concerning over time. For example, it is possible that some alignment difficulties become so clear that the government enacts regulations, or all of the major AGI labs agree to extremely strong plans to slow down. I’m not particularly optimistic about these scenarios because (a) I don’t expect alignment difficulties will become extremely clear until systems are highly dangerous, (b) even if some alignment difficulties become clear, I don’t think we’ll achieve a consensus around how important it is to slow down, (c) given the pace of AI development, and how complicated technical AI safety is, it seems unlikely that governments will be able to react with useful measures in time (as opposed to passing things that superficially appear to advance some broad notion of AI safety but don’t actually prevent AI takeover), and (d) even if several major AI labs agreed to strict safety regulations, it only takes one lab to defect; by ignoring some of the cautious safety provisions, this less safety-conscious lab might be able to “win the race”. Nonetheless, I would be excited (and update my view on how hard coordination will be) if we see labs seriously coordinating on things like publication policies, information-sharing policies, model evaluation policies/​committees, model-sharing policies, windfall clauses, and “merge and assist” clauses.

C. The fact that capabilities advances can be justified as helping alignment research will worsen this balance and increase incentives for capabilities research: In an ideal world, we would be able to create the most powerful AI assistant that does not kill us, then we would stop capabilities research, and then we would use this powerful system to help us solve alignment. I expect that this logic will be used to justify using AI assistants to build more powerful AI assistants. A well-intentioned researcher might think “GPT-N is helping speed up alignment research by 2X. But if we just scaled it a little bit more, or made this particular tweak, it seems like it’ll speed up alignment research by 10X. And we’re pretty confident it’s safe—I mean, after all, all of the existing models have been safe. And the current AI assistants don’t see any major problems. Let’s do it.” And then we create a system that ends the world. Note also that race dynamics play a role here as well; even if one lab is satisfied with an AI assistant that is only capable of verifying proofs, another lab might decide that they need a system that is capable of generating novel alignment ideas, and it only takes one lab to build a model that destroys the world.

D. The fact that systems will be able to help humanity will worsen this balance and increase incentives for capabilities research. Imagine that we had a system that was capable of curing diseases, reducing poverty, mitigating climate change, and reducing the likelihood of war. It seems plausible that systems that are good enough to (meaningfully) help with alignment research will be powerful enough to achieve several of these goals (at minimum, I expect they should be able to accelerate research across a variety of domains). Now imagine the enormous pressure that AGI labs will face to build and use technologies for these purposes. Sure, there are people on the alignment forum who are worried about existential risk, but if we just [scale/​implement X architectural improvement/​hook Y up to the internet/​give Z some resources], we’d be able to [cure cancer/​save 10 million lives/​build energy technology that massively reduce our carbon output].

2. The OpenAI plan seems (much) worse if we have sharp left turns or major discontinuities.

“We expect the transition to be somewhat continuous, but if there are major discontinuities or paradigm shifts, then most lessons learned from aligning models like InstructGPT might not be directly useful.”

If capabilities advances are not continuous, a few parts of the OpenAI plan become considerably weaker. First, we’re no longer able to iterate as well. Alignment techniques that are useful on weak(er) systems become considerably less likely to generalize to stronger systems, and we might never be able to do experiments on powerful systems (in a very discontinuous world, problems like situational awareness and deception are never revealed until we build an AGI capable of overpowering humanity).

Second, we’re less likely to develop highly intelligent AI assistants (until it’s too late). If AI systems need to stumble upon core consequentialist reasoning abilities before they are able to make useful scientific contributions, we may never develop sufficiently-helpful AI assistants (until we have already developed a system capable of overpowering humanity). Furthermore, even if we do develop an intelligent AI assistant this assistant is less likely to come up with useful ideas in a discontinuous world. A 100 IQ system might be able to develop alignment techniques that work for a 102 IQ system but fail to develop alignment techniques that work for a 200 IQ system. In worlds with discontinuous progress, GPT-N might be the “100 IQ” system and GPT-N+1 (the next system we develop) might be the “200 IQ” system.

So, will progress be continuous, or will it be discontinuous? I don’t have a strong inside view here, but I think this is one of the most important things to figure out in order to evaluate the OpenAI plan. On one hand, this seems epistemically backward: perhaps the burden of proof should be on “sharp left turn advocates” to produce compelling evidence. On the other hand, given that the stakes are so high, I think it’s reasonable to ask that we have evidence against this hypothesis before we take irreversible actions that have world-changing implications (in Jan’s words, “the burden of proof is always on showing that a new system is sufficiently aligned.”)

With this in mind, I’d be excited for OpenAI to say more about (a) their plan for figuring out the likelihood of discontinuous progress (e.g., specific research ideas that could help us verify or invalidate the sharp left turn hypotheses), (b) how their alignment plan would change if we get evidence that the sharp left turn seems plausible/​likely, and (c) how they would proceed if we continue to be uncertain about how continuous progress will be.

3. The incentive to develop an aligned AGI is not strong enough, and it will not by default produce differential alignment progress.

“In general, everyone who is developing AGI has an incentive to make it aligned with them. This means they’d be incentivized to allocate resources towards alignment and thus the easier we can make it to do this, the more likely it is for them to follow this incentives.”

I agree that everyone developing AGI has an incentive to make it aligned with them. However, they have various other incentives, such as: (a) an incentive to make money from AGI, (b) an incentive to make sure that AGI is deployed responsibly or in accordance with their values, (c) an incentive to make sure that AGI is developed by a safety-conscious actor, (d) an incentive to keep their employees and investors satisfied, (e) social incentives to do things that their peers would consider “cool” and avoid doing things that seem “weird” and (f) incentives to leverage AI to improve/​shape the world (e.g., reduce suffering, build exciting new technologies).

My concern is not that someone will lack the incentive to build an AGI aligned with them. My concern is that Jan’s statement is overly simplistic, and it ignores the fact that there are plenty of incentives that push against a cautious security mindset, especially if doing so requires overcoming the fear of lonely dissent.

On top of this, recall that we’re not going to know whether or not a given system will be aligned. There will be a large amount of uncertainty around a given training run, with MIRI folks saying “this is dangerous—don’t do it!”, optimists saying “this seems fine and the concerns are largely speculative”, and many people falling in-between.

My concern is that in the face of this uncertainty, even people with the incentive to build AGI that is aligned with them can fail to properly invest into alignment. If in the status quo, we saw organizations that scored highly on dimensions of operational adequacy, and there was substantial peer pressure pushing people toward alignment (or at least mainstream researchers could mention alignment concerns without fear of feeling embarrassed as their colleagues say “wait, you actually think AI is going to kill us all? Like the terminator?”), and there weren’t any race dynamics to worry about, then I’d be more confident that the incentive to build an aligned system would be sufficient.

4. We might require AI assistants that are not clearly-safe

“A hypothetical example of a pretty safe AI system that is clearly useful to alignment research is a theorem-proving engine: given a formal mathematical statement, it produces a proof or counterexample.”

I agree that a theorem-proving engine would be (a) safe and (b) helpful for alignment research.

Of course, the problem with a theorem-proving engine is that it can only prove theorems. Humans still need to identify useful alignment agendas, make enough progress on these agendas such that they can be mathematically formalized, and then supply a concrete mathematical statement.

It seems plausible that at some point we will require systems that are not guaranteed-to-be-safe. For example, we might need AI assistants to (a) identify entirely new problems, (b) propose new research agendas, (c) build or modify systems in ways that we don’t understand, (d) come up with entirely new research methods or paradigms.

Would such systems definitely be dangerous? No one knows for sure how intelligent a system will need to be to be capable of overpowering humanity. But as an upper-bound, once we get systems that are able to perform research better than humans, I think we should be concerned, and we can at least conclude that such a system is not-guaranteed-to-be-safe. A straightforward “AI takeover scenario” would involve the AI using its research abilities to develop dangerous new technologies (though a system with strong scientific reasoning and world-modeling might be able to overpower humanity through less direct and more creative means as well).

Would we need to build systems that are capable of overpowering humanity? Maybe. It would be terrific if we could find a scalable solution for alignment with systems that are not capable of destroying the world. However, it’s possible that we will need systems that would be able to unlock parts of the alignment tech tree that humans alone would not be able to discover (or would only have been able to discover with hundreds or thousands of years).

If alignment requires such systems (or if some people think alignment requires such systems), we will not be able to rely on clearly-safe AI assistants.

5. Evaluation may be easier than generation, but it’s still very hard

“The alignment research community disagrees whether these metrics really point in the right direction, but they can verify that we’re making progress on our shorter-term goals.”

If we had a set of metrics or a set of short-term goals that clearly indicated progress on alignment, I would be more optimistic about Jan’s emphasis on iteration.

However, in practice, I don’t think we have a set of “short-term goals” that tracks progress on things like “how good are our alignment strategies at aligning powerful systems with the potential to overpower humanity?”

I’m not quite sure which shorter-term goals Jan considers relevant. I agree that the alignment community will be able to verify things like “to what extent do InstructGPT and ChatGPT adhere to human preferences?” or “to what extent do weak models help humans with evaluation?”, but these don’t seem like metrics that will track core parts of the alignment problem that are expected to arise with more powerful systems. In other words, my model of “we might at some point get a system that is situationally aware, power-seeking, and deception” is relatively unaffected by data along the lines of “RLHF improved the extent to which InstructGPT was rated useful by humans.”

If there are metrics that can track our progress toward detecting deception, inner misalignment, situational awareness, or power-seeking, these would be extremely valuable. The closest thing I saw was the discriminator-critique gap, but I don’t expect this to scale to powerful systems that have incentives to deceive us or play the training game.

With this in mind, perhaps the current metrics might be enough to get us to a system that can meaningfully help us with alignment research, and then this system can help us come up with better metrics. (We don’t need to find a set of metrics that scales indefinitely; we just need to find a set of metrics that helps us build a sufficiently-powerful AI assistant).

I think this is a reasonable goal. But I’d be excited for OpenAI (or others) to say more about (a) which specific metrics they are most excited about, (b) the extent to which expect these metrics to work at higher levels of capabilities (e.g., work on models that develop situational awareness), and (c) if we have an aligned AI assistant that is tasked with generating new metrics, how would we evaluate the extent to which the metrics would work on more powerful systems?

6. It may be more difficult to automate alignment than to automate capabilities research.

“Compared to ML research, alignment research is much more pre-paradigmatic and needs to sort out its fundamentals. The kind of tasks that help crystallize what the right paths, concepts, formalisms and cognitive tools are would be more differentially helpful to alignment… ML progress is largely driven by compute, not research.”

There are some reasons to believe that automating alignment research may be easier than automating capabilities research. For example, (a) much more effort has gone into capabilities research thus far (so alignment is more neglected and there may be some low-hanging fruit for AI assistants to find), (b) ML progress has been driven by compute, not research, and (c) AI assistants might be uniquely good at discovering new paradigms (as opposed to doing empirical work).

However, there are also strong reasons to believe that automating alignment research will be harder than automating capabilities research. For example, (a) capabilities research has much clearer benchmarks and metrics than alignment research, (b) capable systems will be able to produce a lot of economic value for AI labs, giving them much more access to compute, (c) AI assistants might be uniquely good at running many experiments extremely quickly (as opposed to doing conceptual work), and (d) progress on capabilities research might be easier to verify than progress on alignment research.

It’s tricky to weigh these arguments against each other, but I would currently bet on the side of “automating alignment is harder than automating capabilities.” In particular, I think that systems will be especially good at helping us in areas where (a) iteration/​speed are valuable, (b) problems are concrete and progress is easy to verify, and (c) there are clear metrics to pursue.

7. The shift from RL agents to LLMs is good, but I expect we’ll have strong incentives to turn LLMs into agents (and we’re already fine-tuning LLMs with RL).

“Large language models (LLMs) make this a lot easier: they come preloaded with a lot of humanity’s knowledge, including detailed knowledge about human preferences and values. Out of the box they aren’t agents who are trying to pursue their own goals in the world.”

I agree with Jan’s assessment that LLMs seem easier to align than RL agents, as they’re not pursuing their own goals (at least “out of the box”). However, I expect we will have strong incentives to turn LLMs into agents. I buy several of Gwern’s arguments: agents are faster and more useful for a variety of economic tasks, agents are more intelligent and more powerful than non-agents, agents can generate higher-quality data to train on, and agents will improve more quickly than non-agents.

Therefore, even though I consider us fortunate to be in a world where we might build AGI primarily through non-RL methods, I think the pressure to use RL to turn these systems into agents is still very high. I’d be more optimistic if any of the following were true: (a) there were one AGI project with extremely high operational adequacy, (b) the government mandated a set of reasonable safety standards, likely involving monitoring/​auditing of training procedures, (c) we could avoid race dynamics and unilateralist curses, (d) there was greater widespread consensus about the legitimacy and importance of AI x-risk (e.g., among mainstream academics and public intellectuals).

8. Even if we get automated alignment assistants, we will still need to ask people to stop capabilities work.

“Right now alignment research is mostly talent-constrained. Once we reach a significant degree of automation, we can much more easily reallocate GPUs between alignment and capability research. In particular, whenever our alignment techniques are inadequate, we can spend more compute on improving them. Additional resources are much easier to request than requesting that other people stop doing something they are excited about but that our alignment techniques are inadequate for.”

I agree that it is easier to ask for compute than to ask people to slow down. If we can get to the point where alignment research is primarily bottlenecked on compute, we will be in a better position.

However, it doesn’t seem clear to me that we’ll get to a point where alignment is mostly compute-constrained. If we get AI assistants, I imagine the biggest constraint will not be “do we have enough compute” but rather “are we able to prevent further capabilities advances that could cause an x-risk (while we wait for our AI assistants to come up with alignment solutions.”

In other words, I envision having systems that could help with alignment or capabilities, and then the main question will not be how much compute do we have for alignment but rather how much do we use these systems for capabilities progress vs. alignment progress? And unfortunately, in this world, we do still need to ask people to stop doing something they are excited about (i.e., stop doing AI-assisted capabilities work).


I first want to reiterate my appreciation to Jan and the OpenAI team for posting their thoughts about alignment publicly. In addition to the fact that this allows the alignment community to engage with OpenAI’s plan, it also sets a norm that I hope other labs will adopt. I’m excited to see if DeepMind, Anthropic, and other AGI labs follow this path.

I think the universe is asking us to have an unusually-high level of caution toward problems that are unusually speculative which rely on claims that are often unusually difficult to evaluate using common scientific techniques. I was impressed that Jan and his colleagues recognized many of these difficulties in his post and described several possible limitations of the current plan. I hope they share more of his thinking (and OpenAI’s thinking) as the plan evolves.

Crossposted to EA Forum (16 points, 0 comments)