Ten Levels of AI Alignment Difficulty
Chris Olah recently released a tweet thread describing how the Anthropic team thinks about AI alignment difficulty. On this view, there is a spectrum of possible scenarios ranging from ‘alignment is very easy’ to ‘alignment is impossible’, and we can frame AI alignment research as a process of increasing the probability of beneficial outcomes by progressively addressing these scenarios. I think this framing is really useful, and here I have expanded on it by providing a more detailed scale of AI alignment difficulty and explaining some considerations that arise from it.
The discourse around AI safety is dominated by detailed conceptions of potential AI systems and their failure modes, along with ways to ensure their safety. This article by the DeepMind safety team provides an overview of some of these failure modes. I believe that we can understand these various threat models through the lens of “alignment difficulty”—with varying sources of AI misalignment sorted from easy to address to very hard to address, and attempt to match up technical AI safety interventions with specific alignment failure mode scenarios. Making this uncertainty clearer makes some debates between alignment researchers easier to understand.
An easier scenario could involve AI models generalising and learning goals in ways that fit with common sense. For example, it could be the case that LLMs of any level of complexity are best understood as generative frameworks over potential writers, with Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI (CAI) selecting only among potential writers. This is sometimes called ‘alignment by default’.
A hard scenario could look like that outlined in ‘Deep Deceptiveness’, where systems rapidly and unpredictably generalise in ways that quickly obsolete previous alignment techniques, and they also learn deceptive reward-hacking strategies that look superficially identical to good behaviour according to external evaluations, red-teaming, adversarial testing or interpretability examinations.
When addressing the spectrum of alignment difficulty, we should examine each segment separately.
If we assume that transformative AI will be produced, then the misuse risk associated with aligned transformative AI does not depend on how difficult alignment is. Therefore, misuse risk is a relatively bigger problem the easier AI alignment is.
Easy scenarios should therefore mean more resources should be allocated to issues like structural risk, economic implications, misuse, and geopolitical problems. On the ‘harder’ end of easy, where RLHF-trained systems typically end up honestly and accurately pursuing oversimplified proxies for what we want, like ‘improve reported life satisfaction’, or ‘raise the stock price of company X’, we also have to worry about scenarios like Production Web or What Failure looks like 1 which require a mix of technical and governance interventions to address.
Intermediate scenarios are cases where behavioural safety isn’t good enough and the easiest ways to produce Transformative AI result in dangerous deceptive misalignment. This is when systems work against our interests but pretend to be useful and safe. This scenario requires us to push harder on alignment work and explore promising strategies like scalable oversight, AI assistance on alignment research and interpretability-based oversight processes. We should also focus on governance interventions to ensure the leading projects have the time they need to actually implement these solutions and then use them (in conjunction with governments and civil society) to change the overall strategic landscape and eliminate the risk of misaligned AI.
In contrast, if alignment is as hard as pessimistic scenarios suggest, intense research effort in the coming years to decades may not be enough to make us confident that Transformative AI can be developed safely. Alignment being very hard would call for robust testing and interpretability techniques to be applied to frontier systems. This would reduce uncertainty, demonstrate the truth of the pessimistic scenario, and build the motivation to stop progress towards Transformative AI altogether.
If we knew confidently whether we were in an optimistic or pessimistic scenario, our available options would be far simpler. Strategies that are strongly beneficial and even necessary to ensure successful alignment in an easy scenario are dangerous and harmful and in the extreme will cause an existential catastrophe in a hard scenario. This makes figuring out what to do much more difficult.
Here I will present a scale of increasing AI alignment difficulty, with ten levels corresponding to what techniques are sufficient to ensure sufficiently powerful (i.e. Transformative) AI systems are aligned. I will later get a bit more precise about what this scale is measuring.
At each level, I will describe what techniques are sufficient for the alignment of TAI, and then describe roughly what that scenario entails about how TAI behaves, and then list the main failure modes at that level.
This scale is ordinal, and the difficulty of the last few steps might increase disproportionately compared to the difficulty of the first few. My rough perception is that there’s a big jump between 5 and 6, and another between 7 and 8.
The ordering is my own attempt at placing these in terms of what techniques I expect supersede previous techniques (i.e. solve all of the problems those techniques solve and more), but I’m open to correction here. I don’t know how technically difficult each will turn out to be, and it’s possible that e.g., interpretability research will turn out to be much easier than getting scalable oversight to work, even though I think interpretability-based oversight is more powerful than oversight that doesn’t use interpretability.
I have done my best to figure out what failure modes each technique can and can’t address, but there doesn’t seem to be much consistency in views about this (which techniques can and can’t address which specific failure modes is another thing we’re uncertain about).
Later AI alignment techniques include and build upon the previous steps, e.g. scalable oversight will probably use component AIs that have also been fine tuned with RLHF. This cumulative nature of progress means that advancing one level often requires addressing the challenges identified at preceding levels.
The current state of the art in terms of what techniques are in fact applied to cutting-edge models is between a 2 and a 3, with substantial research effort also applied to 4-6.
Although this scale is about the alignment of transformative AI, not current AI, it seems plausible that it is harder to align more sophisticated than less sophisticated AIs. For example, current LLMs like GPT-4 are plausibly between a 2 and a 3 in terms of what’s required for alignment, and fortunately that’s also where the current state of the art on alignment is. However, GPT-4.5 might ‘move up’ and require 5 for successful alignment.
Therefore you could view the overall alignment problem as a process of trying to make sure the current state of the art on alignment always stays ahead of the frontier of what is required.
In the table, the green stands for the ‘easy’ cases that Chris Olah talked about, where unusually reckless actors, misuse and failing on technically easy problems because of institutional mistakes are the main concerns. Orange and Red correspond to ‘intermediate’ cases where increasingly hard technical problems must be solved and black corresponds to cases where alignment is probably impossible in practice, if not in principle.
This scale can be compared and is similar to the idea that there is some ‘adversarial pressure’ from more and more sophisticated AI systems as described here.
|Difficulty Level||Alignment technique X is sufficient||Description||Key Sources of risk|
|1||(Strong) Alignment by Default||As we scale up AI models without instructing or training them for specific risky behaviour or imposing problematic and clearly bad goals (like ‘unconditionally make money’), they do not pose significant risks. Even superhuman systems basically do the commonsense version of what external rewards (if RL) or language instructions (if LLM) imply.|
Misuse and/or recklessness with training objectives.
RL of powerful models towards badly specified or antisocial objectives is still possible, including accidentally through poor oversight, recklessness or structural factors.
|2||Reinforcement Learning from Human Feedback|
We need to ensure that the AI behaves well even in edge cases by guiding it more carefully using human feedback in a wide range of situations, not just crude instructions or hand-specified reward functions.
When done diligently, RLHF fine tuning works. One reason to think alignment will be this easy is if systems are naturally inductively biased towards honesty and representing the goals humans give them. In that case, they will tend to learn simple honest and obedient strategies even if these are not the optimal policy to maximise reward.
Even if human feedback is sufficient to ensure models roughly do what their overseer intends, systems widely deployed in the economy may still for structural reasons end up being trained to pursue crude and antisocial proxies that don’t capture what we really want.
Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
|3||Constitutional AI||Human feedback is an insufficiently clear and rich signal with which to fine tune AIs. It must be augmented with AI-provided simulations of human feedback to cover edge cases. This is ‘reinforcement learning from AI feedback’.|
Behavioural Safety is Insufficient
Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)
We need methods to ensure that human-like oversight of AIs continues even for problems unaided humans can’t supervise.
Therefore, we need methods which, unlike Constitutional AI, get AIs to apply humanlike supervision more effectively than humans can. Some strategies along these lines are discussed here.
There are many sub-approaches here, outlined for example in the ELK report.
|5||Scalable Oversight with AI Research Assistance|
At this stage, we are entrusting the AIs aligned using techniques like those in 1-4 to perform research on better methods of oversight and to augment human understanding. We are then using those research outputs to improve our oversight processes or improve the overseer AI’s understanding of the behaviour of the AI in training.
There are many potential approaches here, including techniques like IDA and debate, which are discussed here.
|6||Oversight employing Advanced Interpretability Techniques|
Conceptual or Mechanistic Interpretability tools are used as part of the (AI augmented) oversight process.
Processes internal to the AIs that seem to correlate with deceptiveness can be detected and penalised by the AI or Human+AI overseers developed in 4 and 5.
The ELK report discusses some particular approaches to this, such as penalising correlates of deceptive thinking (like excessive computation time spent on simple questions).
|7||Experiments with Potentially Catastrophic Systems to Understand Misalignment|
At this level, even when we use the techniques in 2-6, AI systems routinely defeat oversight and continue unwanted behaviour. They do this by altering their internal processes to avoid detection by interpretability tools, and by ‘playing the training game’ to seem behaviourally safe. Crucially, though, it is still possible to contain these systems.
Therefore we can conduct (potentially dangerous) experiments with these AI systems to understand how they might generalise post-deployment. Here we would employ the interpretability and oversight tools in 4,5 and 6 and attempt to elicit misgeneralization and reward hacking behaviour from AIs. But we wouldn’t try to remove those behaviours using oversight-based training.
Instead, using techniques like red-teaming we try to understand the basis on which they occur, learning practical tricks and new theoretical insights about how and why misalignment arises, allowing us to develop new mitigations.
Sharp Left Turn
Past this point, we assume, following Nate Soares, that when systems become capable enough there is a sharp left turn. This is where AIs suddenly become much more strategically aware and able to consider a very wide range of plans, such that they quickly become able to avoid any attempts to align their behaviour.
|8||Theoretical Research for Understanding Misalignment|
All of the problematic behaviours either emerge too quickly to respond, or emerge only when systems are deployed, so experiments like those in 7 are useless or cannot be performed safely.
For this capability level and following Soares’ view, we need to have an understanding in advance of how AI models generalise their learning. Hopefully, these foundational insights will inspire new techniques for alignment.
There is a ‘sharp left turn’ where a system suddenly gains new concepts and much greater intelligence and generality, obsoleting previous alignment techniques.
Systems post-SLT are superintelligent and cannot be experimented on safely or contained.
|9||Coming up with a Fundamentally New Paradigm (abandoning deep learning)||According to this view, deep learning systems cannot be aligned even if you change fundamental things about them.|
|10||Impossible||Alignment of a superintelligent system is impossible in principle.||Alignment is theoretically impossible, incoherent or similar.|
Defining ‘Alignment Difficulty’
Chris Olah’s original thread was quite informal, but I thought it would be useful to flesh this model out in more detail. His original ‘how difficult is alignment’ scale described various technical or mathematical projects, including building a working steam engine or solving P vs. NP, which represent an informal scale measuring, roughly, ‘how hard the alignment problem is overall’. But we can be more precise about what this scale measures. And we can place existing approaches to alignment along this alignment difficulty scale.
So, to be more precise about what the scale measures, I define alignment difficulty as the degree of complexity and effort required to successfully guide an AI system’s behaviour, objectives, and decisions to conform with human values and expectations, well enough to effectively mitigate the risks posed by unaligned, potentially dangerous AI systems.
What ‘well enough’ means
Scaling techniques indefinitely
One perspective on ‘well enough’ is that a technique should scale indefinitely, implying that it will continue to be effective regardless of the system’s intelligence. For example, if RLHF always works on arbitrarily powerful systems, then RLHF is sufficient. This supersedes other criteria: if a technique always works, it will also work for a system powerful enough to help mitigate any risk.
Techniques which produce positively transformative AI
An alternative perspective is that a technique works ‘well enough’ if it can robustly align AI systems that are powerful enough to be deployed safely (maybe in a research lab, maybe in the world as a whole), and that these AI systems are transformative in ways that reduce the overall risk from unaligned AI. Call such systems ‘positively transformative AI’.
Positively transformative AI systems could reduce the overall risk from AI by: preventing the construction of a more dangerous AI; changing something about how global governance works; instituting surveillance or oversight mechanisms widely; rapidly and safely performing alignment research or other kinds of technical research; greatly improving cyberdefense; persuasively exposing misaligned behaviour in other AIs and demonstrating alignment solutions, and through many other actions that incrementally reduce risk.
One common way of imagining this process is that an aligned AI could perform a ‘pivotal act’ that solves AI existential safety in one swift stroke. However, it is important to consider this much wider range of ways in which one or several transformative AI systems could reduce the total risk from unaligned transformative AI.
As well as disagreeing about how hard it is to align an AI of a certain capability, people disagree on how capable an aligned AI must be to be positively transformative in some way. For example, Jan Leike et al. think that there is no known indefinitely scalable solution to the alignment problem, but that we won’t need an indefinitely scalable alignment technique, just one that is good enough to align a system that can automate alignment research.
Therefore, the scale should be seen as covering increasingly sophisticated techniques which work to align more and more powerful and/or adversarial systems.
Alignment vs. Capabilities
There’s a common debate in the alignment community about what counts as ‘real alignment research’ vs just advancing capabilities, see e.g. here for one example of the debate around RLHF, but similar debates exist around more sophisticated methods like interpretability-based oversight.
This scale helps us understand where this debate comes from.
Any technique on this scale which is insufficient to solve TAI alignment will (if it succeeds) make a system appear to be ‘aligned’ in the short term. Many techniques, such as RLHF, also make a system more practically useful and therefore more commercially viable and more capable. Spending time on these alignment techniques also trades off with other kinds of alignment work.
Given our uncertainty about what techniques are and aren’t sufficient, the boundary between what counts as ‘just capabilities research’ vs ‘alignment research’ isn’t exact. In other words, before the (unknown) point at which a given technique becomes sufficient, advancing less effective alignment techniques is at best diverting resources away from useful efforts, but more likely just advancing AI capabilities and helping to hide problems. In other words, advancing alignment techniques which are insufficient for solving the problem at best takes resources away from more useful work and at worst advances capabilities whilst concealing misaligned behaviour.
However, crucially, since we don’t know where on the scale the difficulty lies, we don’t know whether working on a given technique is counterproductive or not. Additionally, as we go further along the scale it becomes harder and harder to make progress. RLHF is already widely used and applied to cutting-edge systems today, with constitutional AI not far away (2-3), whereas the hopes for coming up with a new AI paradigm that’s fundamentally safer than deep learning (9) seem pretty thin.
Therefore, we get the phenomenon of, “everyone more optimistic about alignment than me is just working on capabilities.”
I think that acknowledging our uncertainty about alignment difficulty takes some of the force out of arguments that e.g., constitutional AI research is net-negative. This kind of research is beneficial in some worlds, even though it creates negative externalities and could make AIs more dangerous in worlds where alignment is harder.
Chris Olah’s original point was that given this uncertainty we should aim to push the frontier further than it is already, so the optimal strategy is to promote any method which is past the ‘present margin of safety research’, i.e. would counterfactually not get worked on and works to reduce the risk in some scenarios.
In summary, I have:
Expanded upon Chris Olah’s framing of alignment difficulty as a spectrum over which we have some uncertainty about which techniques will and won’t be sufficient to ensure alignment.
Attempted to line up specific failure scenarios with the alignment techniques meant to mitigate them (e.g. RLHF on the right objective solves the problem that simple reward functions can’t capture what we want, oversight solves the problem that human oversight won’t be good enough to govern superhuman systems, interpretability solves the problem that even perfectly behaviourally safe systems could be deceptive).
Explained what the scale of alignment difficulty is measuring in more detail, i.e. the degree of complexity and effort required to successfully guide a TAI system’s behaviour, objectives, and decisions to conform with human values and expectations, well enough to effectively mitigate the risks posed by unaligned TAI systems.
Outlined some of the difficulties arising from the fact that we don’t know what techniques are sufficient, and insufficient techniques plausibly just accelerate AI capabilities and so actually increase rather than reduce risk.
I’d be interested to know what people think of my attempted ordering of alignment techniques by increasing sophistication and matching them up with the failure modes they’re meant to address. I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.
Oversight which employs interpretability tools would catch failure modes that oversight without interpretability tools wouldn’t, i.e. failure modes where, before deployment, a system seems safe according to superhuman behavioural oversight, but is actually being deceptive and power seeking.
Systems that aren’t being deceptive in a strategically aware way might be misaligned in subtler ways that are still very difficult to deal with for strategic or governance reasons (e.g. strong competitive economic pressures to build systems that pursue legible goals), so you might object to calling this an ‘easy’ problem just because in this scenario RLHF doesn’t lead to deceptive strategically aware agents. However, this is a scale of the difficulty of technical AI alignment, not how difficult AI existential safety is overall.
Ajeya Cotra’s report on AI takeover lists this as a potential solution: ‘We could provide training to human evaluators to make them less susceptible to manipulative and dishonest tactics, and instruct them to give reward primarily or entirely based on whether the model followed honest procedures rather than whether it got good results. We could try to use models themselves to help explain what kinds of tactics are more and less likely to be manipulative’ - as an example of something that might do better than behavioral safety.
Therefore, I do not take her report to be arguing that external behavioural evaluation will fail to eliminate deception. Instead, I interpret her as arguing that any human-level external behavioural evaluation won’t work but that superior oversight processes can work to eliminate deception. However, even superhuman behavioural oversight is still a kind of ‘behavioral safety’.
It is essential to distinguish the sharp left turn from very fast AI progress due to e.g., AIs contributing a continuously but rapidly increasing fraction of work to AI research efforts. While fast progress could pose substantial governance problems, it wouldn’t mean that oversight-based approaches to alignment fail.
If progress is fast enough but continuous, then we might have a takeoff that looks discontinuous and almost identical to the sharp left turn from the perspective of the wider world outside AI labs. However, within the labs it would be quite different, because running oversight strategies where more and more powerful AIs oversee each other would still be feasible.
Take any person’s view of how difficult alignment is, accounting for their uncertainty over this question. It could be possible to model the expected benefit of putting resources into a given alignment project, knowing that it could help to solve the problem, but it could also make the problem worse if it merely produces systems which appear safe. Additionally, this modelling has to account for how putting resources into this alignment solution takes away resources from other solutions (the negative externality). Would it be productive to try to model the optimal allocation of resources in this way, and what would the result of this modelling be?