A central AI alignment problem: capabilities generalization, and the sharp left turn
(This post was factored out of a larger post that I (Nate Soares) wrote, with help from Rob Bensinger, who also rearranged some pieces and added some text to smooth things out. I’m not terribly happy with it, but am posting it anyway (or, well, having Rob post it on my behalf while I travel) on the theory that it’s better than nothing.)
I expect navigating the acute risk period to be tricky for our civilization, for a number of reasons. Success looks to me to require clearing a variety of technical, sociopolitical, and moral hurdles, and while in principle sufficient mastery of solutions to the technical problems might substitute for solutions to the sociopolitical and other problems, it nevertheless looks to me like we need a lot of things to go right.
Some sub-problems look harder to me than others. For instance, people are still regularly surprised when I tell them that I think the hard bits are much more technical than moral: it looks to me like figuring out how to aim an AGI at all is harder than figuring out where to aim it.[1]
Within the list of technical obstacles, there are some that strike me as more central than others, like “figure out how to aim optimization”. And a big reason why I’m currently fairly pessimistic about humanity’s odds is that it seems to me like almost nobody is focusing on the technical challenges that seem most central and unavoidable to me.
Many people wrongly believe that I’m pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That’s flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.[2]
I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it’s somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it’s somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it’s all that qualitatively different than the sorts of summits humanity has surmounted before.
It’s made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.
What undermines my hope is that nobody seems to be working on the hard bits, and I don’t currently expect most people to become convinced that they need to solve those hard bits until it’s too late.
Below, I’ll attempt to sketch out what I mean by “the hard bits” of the alignment problem. Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem. My hope in writing this post is to better point at the problem, with a follow-on hope that this causes new researchers entering the field to attack what seem to me to be the central challenges head-on.
Discussion of a problem
On my model, one of the most central technical challenges of alignment—and one that every viable alignment plan will probably need to grapple with—is the issue that capabilities generalize better than alignment.
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.
Some people I say this to respond with arguments like: “Surely, before a smaller team could get an AGI that can master subjects like biotech and engineering well enough to kill all humans, some other, larger entity such as a state actor will have a somewhat worse AI that can handle biotech and engineering somewhat less well, but in a way that prevents any one AGI from running away with the whole future?”
I respond with arguments like, “In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).”
Their response in turn is generally some variant of “well, natural selection wasn’t optimizing very intelligently” or “maybe humans weren’t all that sharply above evolutionary trends” or “maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us” or suchlike.
Rather than arguing further here, I’ll just say that failing to believe the hard problem exists is one surefire way to avoid tackling it.
So, flatly summarizing my point instead of arguing for it: it looks to me like there will at some point be some sort of “sharp left turn”, as systems start to work really well in domains really far beyond the environments of their training—domains that allow for significant reshaping of the world, in the way that humans reshape the world and chimps don’t. And that’s where (according to me) things start to get crazy. In particular, I think that once AI capabilities start to generalize in this particular way, it’s predictably the case that the alignment of the system will fail to generalize with it.[3]
This is slightly upstream of a couple other challenges I consider quite core and difficult to avoid, including:
Directing a capable AGI towards an objective of your choosing.
Ensuring that the AGI is low-impact, conservative, shutdownable, and otherwise corrigible.
These two problems appear in the strawberry problem, which Eliezer’s been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it’s corrigible (or really well aligned to a delicate human intuitive notion of inaction).
How is the “capabilities generalize further than alignment” problem upstream of these problems? Suppose that the fictional team OpenMind is training up a variety of AI systems, before one of them takes that sharp left turn. Suppose they’ve put the AI in lots of different video-game and simulated environments, and they’ve had good luck training it to pursue an objective that the operators described in English. “I don’t know what those MIRI folks were talking about; these systems are easy to direct; simple training suffices”, they say. At the same time, they apply various training methods, some simple and some clever, to cause the system to allow itself to be removed from various games by certain “operator-designated” characters in those games, in the name of shutdownability. And they use various techniques to prevent it from stripmining in Minecraft, in the name of low-impact. And they train it on a variety of moral dilemmas, and find that it can be trained to give correct answers to moral questions (such as “in thus-and-such a circumstance, should you poison the operator’s opponent?”) just as well as it can be trained to give correct answers to any other sort of question. “Well,” they say, “this alignment thing sure was easy. I guess we lucked out.”
Then, the system takes that sharp left turn,[4][5] and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart.
The techniques OpenMind used to train it away from the error where it convinces itself that bad situations are unlikely? Those generalize fine. The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.
Why does alignment fail while capabilities generalize, at least by default and in predictable practice? In large part, because good capabilities form something like an attractor well. (That’s one of the reasons to expect intelligent systems to eventually make that sharp left turn if you push them far enough, and it’s why natural selection managed to stumble into general intelligence with no understanding, foresight, or steering.)
Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure, much like the simple laws of arithmetic that abstractly underwrite a wide variety of empirical arithmetical facts about what happens when you add four people’s bags of apples together on a table and then divide the contents among two people.
But that attractor well? It’s got a free parameter. And that parameter is what the AGI is optimizing for. And there’s no analogously-strong attractor well pulling the AGI’s objectives towards your preferred objectives.
The hard left turn? That’s your system sliding into the capabilities well. (You don’t need to fall all that far to do impressive stuff; humans are better at an enormous variety of relevant skills than chimps, but they aren’t all that lawful in an absolute sense.)
There’s no analogous alignment well to slide into.
On the contrary, sliding down the capabilities well is liable to break a bunch of your existing alignment properties.[6]
Why? Because things in the capabilities well have instrumental incentives that cut against your alignment patches. Just like how your previous arithmetic errors (such as the pebble sorters on the wrong side of the Great War of 1957) get steamrolled by the development of arithmetic, so too will your attempts to make the AGI low-impact and shutdownable ultimately (by default, and in the absence of technical solutions to core alignment problems) get steamrolled by a system that pits those reflexes / intuitions / much-more-alien-behavioral-patterns against the convergent instrumental incentive to survive the day.
Perhaps this is not convincing; perhaps to convince you we’d need to go deeper into the weeds of the various counterarguments, if you are to be convinced. (Like acknowledging that humans, who can foresee these difficulties and adjust their training procedures accordingly, have a better chance than natural selection did, while then discussing why current proposals do not seem to me to be hopeful.) But hopefully you can at least, in reading this document, develop a basic understanding of my position.
Stating it again, in summary: my position is that capabilities generalize further than alignment (once capabilities start to generalize real well (which is a thing I predict will happen)). And this, by default, ruins your ability to direct the AGI (that has slipped down the capabilities well), and breaks whatever constraints you were hoping would keep it corrigible. And addressing the problem looks like finding some way to either keep your system aligned through that sharp left turn, or render it aligned afterwards.
In an upcoming post (edit: here), I’ll say more about how it looks to me like ~nobody is working on this particular hard problem, by briefly reviewing a variety of current alignment research proposals. In short, I think that the field’s current range of approaches nearly all assume this problem away, or direct their attention elsewhere.
- ^
Furthermore, figuring where to aim it looks to me like more of a technical problem than a moral problem. Attempting to manually specify the nature of goodness is a doomed endeavor, of course, but that’s fine, because we can instead specify processes for figuring out (the coherent extrapolation of) what humans value. Which still looks prohibitively difficult as a goal to give humanity’s first AGI (which I expect to be deployed under significant time pressure), mind you, and I further recommend aiming humanity’s first AGI systems at simple limited goals that end the acute risk period and then cede stewardship of the future to some process that can reliably do the “aim minds towards the right thing” thing. So today’s alignment problems are a few steps removed from tricky moral questions, on my models.
- ^
While we’re at it: I think trying to get provable safety guarantees about our AGI systems is silly, and I’m pretty happy to follow Eliezer in calling an AGI “safe” if it has a <50% chance of killing >1B people. Also, I think there’s a very large chance of AGI killing us, and I thoroughly disclaim the argument that even if the probability is tiny then we should work on it anyway because the stakes are high.
- ^
Note that this is consistent with findings like “large language models perform just as well on moral dilemmas as they perform on non-moral ones”; to find this reassuring is to misunderstand the problem. Chimps have an easier time than squirrels following and learning from human cues. Yet this fact doesn’t particularly mean that enhanced chimps are more likely than enhanced squirrels to remove their hunger drives, once they understand inclusive genetic fitness and are able to eat purely for reasons of fitness maximization. Pre-left-turn AIs will get better at various ‘alignment’ metrics, in ways that I expect to build a false sense of security, without addressing the lurking difficulties.
- ^
“What do you mean ‘it takes a sharp left turn’? Are you talking about recursive self-improvement? I thought you said somewhere else that you don’t think recursive self-improvement is necessarily going to play a central role before the extinction of humanity?” I’m not talking about recursive self-improvement. That’s one way to take a sharp left turn, and it could happen, but note that humans have neither the understanding nor control over their own minds to recursively self-improve, and we outstrip the rest of the animals pretty handily. I’m talking about something more like “intelligence that is general enough to be dangerous”, the sort of thing that humans have and chimps don’t.
- ^
“Hold on, isn’t this unfalsifiable? Aren’t you saying that you’re going to continue believing that alignment is hard, even as we get evidence that it’s easy?” Well, I contend that “GPT can learn to answer moral questions just as well as it can learn to answer other questions” is not much evidence either way about the difficulty of alignment. I’m not saying we’ll get evidence that I’ll ignore; I’m naming in advance some things that I wouldn’t consider negative evidence (partially in hopes that I can refer back to this post when people crow later and request an update). But, yes, my model does have the inconvenient property that people who are skeptical now, are liable to remain skeptical until it’s too late, because most of the evidence I expect to give us advance warning about the nature of the problem is evidence that we’ve already seen. I assure you that I do not consider this property to be convenient.
As for things that could convince me otherwise: technical understanding of intelligence could undermine my “sharp left turn” model. I could also imagine observing some ephemeral hopefully-I’ll-know-it-when-I-see-it capabilities thresholds, without any sharp left turns, that might update me. (Short of “full superintelligence without a sharp left turn”, which would obviously convince me but comes too late in the game to shift my attention.) - ^
To use my overly-detailed evocative example from earlier: Humans aren’t tempted to rewire our own brains so that we stop liking good meals for the sake of good meals, and start eating only insofar as we know we have to eat to reproduce (or, rather, maximize inclusive genetic fitness) (after upgrading the rest of our minds such that that sort of calculation doesn’t drag down the rest of the fitness maximization). The cleverer humans are chomping at the bit to have their beliefs be more accurate, but they’re not chomping at the bit to replace all these mere-shallow-correlates of inclusive genetic fitness with explicit maximization. So too with other minds, at least by default: that which makes them generally intelligent, does not make them motivated by your objectives.
- (My understanding of) What Everyone in Technical Alignment is Doing and Why by (29 Aug 2022 1:23 UTC; 413 points)
- Why I think strong general AI is coming soon by (28 Sep 2022 5:40 UTC; 340 points)
- Against Almost Every Theory of Impact of Interpretability by (17 Aug 2023 18:44 UTC; 331 points)
- On how various plans miss the hard bits of the alignment challenge by (12 Jul 2022 2:49 UTC; 316 points)
- Two-year update on my personal AI timelines by (2 Aug 2022 23:07 UTC; 293 points)
- Discussion with Nate Soares on a key alignment difficulty by (13 Mar 2023 21:20 UTC; 274 points)
- MIRI 2024 Mission and Strategy Update by (5 Jan 2024 0:20 UTC; 223 points)
- “Sharp Left Turn” discourse: An opinionated review by (28 Jan 2025 18:47 UTC; 219 points)
- Evolution provides no evidence for the sharp left turn by (11 Apr 2023 18:43 UTC; 211 points)
- A note about differential technological development by (15 Jul 2022 4:46 UTC; 198 points)
- The basic reasons I expect AGI ruin by (18 Apr 2023 3:37 UTC; 189 points)
- A rough and incomplete review of some of John Wentworth’s research by (28 Mar 2023 18:52 UTC; 176 points)
- Thomas Kwa’s MIRI research experience by (2 Oct 2023 16:42 UTC; 174 points)
- The inordinately slow spread of good AGI conversations in ML by (21 Jun 2022 16:09 UTC; 173 points)
- Current safety training techniques do not fully transfer to the agent setting by (3 Nov 2024 19:24 UTC; 162 points)
- POC || GTFO culture as partial antidote to alignment wordcelism by (15 Mar 2023 10:21 UTC; 162 points)
- Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI by (26 Jan 2024 7:22 UTC; 161 points)
- A Rocket–Interpretability Analogy by (21 Oct 2024 13:55 UTC; 156 points)
- MIRI 2024 Mission and Strategy Update by (EA Forum; 5 Jan 2024 1:10 UTC; 154 points)
- Why we’re not founding a human-data-for-alignment org by (EA Forum; 27 Sep 2022 20:14 UTC; 150 points)
- Shard Theory in Nine Theses: a Distillation and Critical Appraisal by (19 Dec 2022 22:52 UTC; 150 points)
- Comments on OpenAI’s “Planning for AGI and beyond” by (3 Mar 2023 23:01 UTC; 149 points)
- Superintelligent AI is necessary for an amazing future, but far from sufficient by (31 Oct 2022 21:16 UTC; 134 points)
- Will Capabilities Generalise More? by (29 Jun 2022 17:12 UTC; 133 points)
- Varieties Of Doom by (17 Nov 2025 21:36 UTC; 131 points)
- Clarifying AI X-risk by (1 Nov 2022 11:03 UTC; 127 points)
- On how various plans miss the hard bits of the alignment challenge by (EA Forum; 12 Jul 2022 5:35 UTC; 126 points)
- An even deeper atheism by (11 Jan 2024 17:28 UTC; 125 points)
- High-level hopes for AI alignment by (EA Forum; 20 Dec 2022 2:11 UTC; 123 points)
- Compendium of problems with RLHF by (29 Jan 2023 11:40 UTC; 123 points)
- Principles for Alignment/Agency Projects by (7 Jul 2022 2:07 UTC; 122 points)
- What is the current most representative EA AI x-risk argument? by (EA Forum; 15 Dec 2023 22:04 UTC; 117 points)
- Comments on OpenAI’s “Planning for AGI and beyond” by (EA Forum; 3 Mar 2023 23:01 UTC; 115 points)
- wrapper-minds are the enemy by (17 Jun 2022 1:58 UTC; 105 points)
- Thoughts on AGI organizations and capabilities work by (7 Dec 2022 19:46 UTC; 102 points)
- AI #8: People Can Do Reasonable Things by (20 Apr 2023 15:50 UTC; 100 points)
- A Case for the Least Forgiving Take On Alignment by (2 May 2023 21:34 UTC; 100 points)
- Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes by (1 May 2023 16:47 UTC; 96 points)
- Concrete Reasons for Hope about AI by (14 Jan 2023 1:22 UTC; 94 points)
- An artificially structured argument for expecting AGI ruin by (7 May 2023 21:52 UTC; 91 points)
- Why we’re not founding a human-data-for-alignment org by (27 Sep 2022 20:14 UTC; 88 points)
- How to Diversify Conceptual Alignment: the Model Behind Refine by (20 Jul 2022 10:44 UTC; 87 points)
- Refining the Sharp Left Turn threat model, part 1: claims and mechanisms by (12 Aug 2022 15:17 UTC; 86 points)
- The Agency Overhang by (21 Apr 2023 7:47 UTC; 85 points)
- Threat Model Literature Review by (1 Nov 2022 11:03 UTC; 79 points)
- JDP Reviews IABIED by (19 Sep 2025 1:23 UTC; 78 points)
- Thoughts on AGI organizations and capabilities work by (EA Forum; 7 Dec 2022 19:46 UTC; 77 points)
- The Core of the Alignment Problem is... by (17 Aug 2022 20:07 UTC; 76 points)
- What does it mean for an AGI to be ‘safe’? by (7 Oct 2022 4:13 UTC; 74 points)
- AI pause/governance advocacy might be net-negative, especially without a focus on explaining x-risk by (27 Aug 2023 23:05 UTC; 72 points)
- AXRP Episode 31 - Singular Learning Theory with Daniel Murfet by (7 May 2024 3:50 UTC; 72 points)
- High-level interpretability: detecting an AI’s objectives by (28 Sep 2023 19:30 UTC; 72 points)
- AGI ruin mostly rests on strong claims about alignment and deployment, not about society by (24 Apr 2023 13:06 UTC; 70 points)
- Gradient descent might see the direction of the optimum from far away by (28 Jul 2023 16:19 UTC; 70 points)
- Orthogonal’s Formal-Goal Alignment theory of change by (5 May 2023 22:36 UTC; 69 points)
- AI Neorealism: a threat model & success criterion for existential safety by (15 Dec 2022 13:42 UTC; 67 points)
- Where I currently disagree with Ryan Greenblatt’s version of the ELK approach by (29 Sep 2022 21:18 UTC; 65 points)
- Continuous doesn’t mean slow by (EA Forum; 10 May 2023 12:17 UTC; 64 points)
- Possible miracles by (9 Oct 2022 18:17 UTC; 64 points)
- SLT for AI Safety by (1 Jul 2025 4:52 UTC; 63 points)
- There Should Be More Alignment-Driven Startups by (31 May 2024 2:05 UTC; 62 points)
- My hopes for alignment: Singular learning theory and whole brain emulation by (25 Oct 2023 18:31 UTC; 61 points)
- Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes by (24 Feb 2023 23:03 UTC; 61 points)
- The inordinately slow spread of good AGI conversations in ML by (EA Forum; 29 Jun 2022 4:02 UTC; 59 points)
- The basic reasons I expect AGI ruin by (EA Forum; 18 Apr 2023 3:37 UTC; 58 points)
- A note about differential technological development by (EA Forum; 24 Jul 2022 23:41 UTC; 58 points)
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by (10 Mar 2023 8:21 UTC; 58 points)
- High-level hopes for AI alignment by (15 Dec 2022 18:00 UTC; 58 points)
- Genetic fitness is a measure of selection strength, not the selection target by (4 Nov 2023 19:02 UTC; 57 points)
- Voting Results for the 2022 Review by (2 Feb 2024 20:34 UTC; 57 points)
- My thoughts on OpenAI’s alignment plan by (30 Dec 2022 19:33 UTC; 55 points)
- We may be able to see sharp left turns coming by (3 Sep 2022 2:55 UTC; 54 points)
- Try to solve the hard parts of the alignment problem by (18 Mar 2023 14:55 UTC; 54 points)
- What does it mean for an AGI to be ‘safe’? by (EA Forum; 7 Oct 2022 4:43 UTC; 53 points)
- AI safety university groups: a promising opportunity to reduce existential risk by (EA Forum; 30 Jun 2022 18:37 UTC; 53 points)
- AI is centralizing by default; let’s not make it worse by (EA Forum; 21 Sep 2023 13:35 UTC; 53 points)
- 2022 (and All Time) Posts by Pingback Count by (16 Dec 2023 21:17 UTC; 53 points)
- 's comment on A challenge for AGI organizations, and a challenge for readers by (3 Dec 2022 5:53 UTC; 53 points)
- Reframing inner alignment by (11 Dec 2022 13:53 UTC; 53 points)
- AXRP Episode 22 - Shard Theory with Quintin Pope by (15 Jun 2023 19:00 UTC; 52 points)
- Outer vs inner misalignment: three framings by (6 Jul 2022 19:46 UTC; 52 points)
- The Importance of AI Alignment, explained in 5 points by (EA Forum; 11 Feb 2023 2:56 UTC; 50 points)
- Transfer learning and generalization-qua-capability in Babbage and Davinci (or, why division is better than Spanish) by (9 Feb 2024 7:00 UTC; 50 points)
- Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor by (9 Jan 2024 20:42 UTC; 48 points)
- Ways to buy time by (EA Forum; 12 Nov 2022 19:31 UTC; 47 points)
- 's comment on List your AI X-Risk cruxes! by (30 Apr 2024 23:53 UTC; 47 points)
- It matters when the first sharp left turn happens by (29 Sep 2022 20:12 UTC; 45 points)
- AXRP Episode 19 - Mechanistic Interpretability with Neel Nanda by (4 Feb 2023 3:00 UTC; 45 points)
- How to Diversify Conceptual AI Alignment: the Model Behind Refine by (EA Forum; 20 Jul 2022 10:44 UTC; 43 points)
- Evolution provides no evidence for the sharp left turn by (EA Forum; 11 Apr 2023 18:48 UTC; 43 points)
- Alignment Is Not All You Need by (2 Jan 2025 17:50 UTC; 43 points)
- MATS AI Safety Strategy Curriculum v2 by (7 Oct 2024 22:44 UTC; 43 points)
- Goal Alignment Is Robust To the Sharp Left Turn by (13 Jul 2022 20:23 UTC; 43 points)
- Are humans misaligned with evolution? by (19 Oct 2023 3:14 UTC; 42 points)
- Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment by (12 Jan 2023 17:09 UTC; 40 points)
- Refining the Sharp Left Turn threat model, part 2: applying alignment techniques by (25 Nov 2022 14:36 UTC; 39 points)
- Possible miracles by (EA Forum; 9 Oct 2022 18:17 UTC; 38 points)
- World-Model Interpretability Is All We Need by (14 Jan 2023 19:37 UTC; 36 points)
- Superintelligent AI is necessary for an amazing future, but far from sufficient by (EA Forum; 31 Oct 2022 21:16 UTC; 35 points)
- Motivating Alignment of LLM-Powered Agents: Easy for AGI, Hard for ASI? by (11 Jan 2024 12:56 UTC; 35 points)
- AI Safety Strategies Landscape by (9 May 2024 17:33 UTC; 34 points)
- Ways to buy time by (12 Nov 2022 19:31 UTC; 34 points)
- The Importance of AI Alignment, explained in 5 points by (11 Feb 2023 2:56 UTC; 33 points)
- Reward is the optimization target (of capabilities researchers) by (15 May 2023 3:22 UTC; 32 points)
- 's comment on New roles on my team: come build Open Phil’s technical AI safety program with me! by (20 Oct 2023 15:36 UTC; 31 points)
- There Should Be More Alignment-Driven Startups by (EA Forum; 31 May 2024 2:05 UTC; 30 points)
- MATS AI Safety Strategy Curriculum v2 by (EA Forum; 7 Oct 2024 23:01 UTC; 29 points)
- 's comment on ⿻ Plurality & 6pack.care by (3 Sep 2025 0:54 UTC; 29 points)
- $500 bounty for alignment contest ideas by (30 Jun 2022 1:56 UTC; 29 points)
- What are MIRI’s big achievements in AI alignment? by (7 Mar 2023 21:30 UTC; 29 points)
- A rough and incomplete review of some of John Wentworth’s research by (EA Forum; 28 Mar 2023 18:52 UTC; 28 points)
- A Lens on the Sharp Left Turn: Optimization Slack by (16 Sep 2025 8:31 UTC; 28 points)
- How Do We Align an AGI Without Getting Socially Engineered? (Hint: Box It) by (10 Aug 2022 18:14 UTC; 28 points)
- An even deeper atheism by (EA Forum; 11 Jan 2024 17:28 UTC; 26 points)
- 's comment on More Was Possible: A Review of IABIED by (19 Sep 2025 4:13 UTC; 25 points)
- Embedding safety in ML development by (31 Oct 2022 12:27 UTC; 24 points)
- Disentangling inner alignment failures by (10 Oct 2022 18:50 UTC; 24 points)
- Evolution Solved Alignment (what sharp left turn?) by (12 Oct 2023 4:15 UTC; 23 points)
- Hands-On Experience Is Not Magic by (27 May 2023 16:57 UTC; 22 points)
- Where I currently disagree with Ryan Greenblatt’s version of the ELK approach by (EA Forum; 29 Sep 2022 21:19 UTC; 21 points)
- Orthogonal’s Formal-Goal Alignment theory of change by (EA Forum; 5 May 2023 22:36 UTC; 21 points)
- Science of Deep Learning more tractably addresses the Sharp Left Turn than Agent Foundations by (19 Sep 2023 22:06 UTC; 21 points)
- Causal representation learning as a technique to prevent goal misgeneralization by (4 Jan 2023 0:07 UTC; 21 points)
- 's comment on Stephen Fowler’s Shortform by (22 Jun 2024 5:08 UTC; 20 points)
- A smart enough LLM might be deadly simply if you run it for long enough by (5 May 2023 20:49 UTC; 19 points)
- 's comment on Pivotal outcomes and pivotal processes by (29 Jun 2022 8:37 UTC; 19 points)
- Cheat sheet of AI X-risk by (29 Jun 2023 4:28 UTC; 19 points)
- Compendium of problems with RLHF by (EA Forum; 30 Jan 2023 8:48 UTC; 18 points)
- $500 bounty for alignment contest ideas by (EA Forum; 30 Jun 2022 1:55 UTC; 18 points)
- Corrigibility Via Thought-Process Deference by (24 Nov 2022 17:06 UTC; 18 points)
- Gearing Up for Long Timelines in a Hard World by (14 Jul 2023 6:11 UTC; 18 points)
- 's comment on The curious case of Pretty Good human inner/outer alignment by (6 Jul 2022 10:41 UTC; 17 points)
- Distribution Shifts and The Importance of AI Safety by (29 Sep 2022 22:38 UTC; 17 points)
- 's comment on AI Pause Will Likely Backfire by (EA Forum; 16 Sep 2023 15:30 UTC; 16 points)
- Christiano (ARC) and GA (Conjecture) Discuss Alignment Cruxes by (EA Forum; 24 Feb 2023 23:03 UTC; 16 points)
- My thoughts on OpenAI’s alignment plan by (EA Forum; 30 Dec 2022 19:34 UTC; 16 points)
- AGI ruin mostly rests on strong claims about alignment and deployment, not about society by (EA Forum; 24 Apr 2023 13:07 UTC; 16 points)
- Victoria Krakovna on AGI Ruin, The Sharp Left Turn and Paradigms of AI Alignment by (EA Forum; 12 Jan 2023 17:09 UTC; 16 points)
- 's comment on Deep Deceptiveness by (4 Jan 2025 11:30 UTC; 16 points)
- 's comment on It Looks Like You’re Trying To Take Over The Narrative by (24 Aug 2022 14:50 UTC; 16 points)
- Some quick thoughts on “AI is easy to control” by (6 Dec 2023 0:58 UTC; 15 points)
- Why I think strong general AI is coming soon by (EA Forum; 28 Sep 2022 6:55 UTC; 14 points)
- A Rocket–Interpretability Analogy by (EA Forum; 21 Oct 2024 13:55 UTC; 14 points)
- AI safety university groups: a promising opportunity to reduce existential risk by (1 Jul 2022 3:59 UTC; 14 points)
- Are Generative World Models a Mesa-Optimization Risk? by (29 Aug 2022 18:37 UTC; 14 points)
- How is the “sharp left turn defined”? by (9 Dec 2022 0:04 UTC; 14 points)
- Validator models: A simple approach to detecting goodharting by (20 Feb 2023 21:32 UTC; 14 points)
- 's comment on Concrete Reasons for Hope about AI by (14 Jan 2023 7:53 UTC; 13 points)
- Thoughts on the OpenAI alignment plan: will AI research assistants be net-positive for AI existential risk? by (EA Forum; 10 Mar 2023 8:20 UTC; 12 points)
- To determine alignment difficulty, we need to know the absolute difficulty of alignment generalization by (14 Mar 2023 3:52 UTC; 12 points)
- AXRP Episode 18 - Concept Extrapolation with Stuart Armstrong by (3 Sep 2022 23:12 UTC; 12 points)
- 's comment on Yes, AI Continues To Make Rapid Progress, Including Towards AGI by (10 Sep 2025 14:58 UTC; 12 points)
- 's comment on Thoughts on AGI organizations and capabilities work by (7 Dec 2022 22:41 UTC; 12 points)
- 's comment on Preventing an AI-related catastrophe—Problem profile by (EA Forum; 29 Aug 2022 21:57 UTC; 11 points)
- 's comment on You won’t solve alignment without agent foundations by (11 Nov 2022 19:42 UTC; 11 points)
- 's comment on Where I currently disagree with Ryan Greenblatt’s version of the ELK approach by (30 Sep 2022 3:04 UTC; 11 points)
- Is Constructor Theory a useful tool for AI alignment? by (29 Nov 2022 12:35 UTC; 11 points)
- 's comment on Evolution’s selection target depends on your weighting by (20 Nov 2024 1:25 UTC; 11 points)
- 's comment on My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by (27 Sep 2023 5:03 UTC; 11 points)
- Timelines ARE relevant to alignment research (timelines 2 of ?) by (24 Aug 2022 0:19 UTC; 11 points)
- 's comment on Evolution Solved Alignment (what sharp left turn?) by (12 Oct 2023 14:14 UTC; 11 points)
- Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”) by (EA Forum; 24 Nov 2023 19:18 UTC; 10 points)
- 's comment on Problems with Robin Hanson’s Quillette Article On AI by (7 Aug 2023 12:44 UTC; 10 points)
- Takes on Takeoff by (25 Mar 2025 0:20 UTC; 10 points)
- Musings on Scenario Forecasting and AI by (6 Mar 2025 12:28 UTC; 10 points)
- 's comment on Has Eliezer publicly and satisfactorily responded to attempted rebuttals of the analogy to evolution? by (20 Nov 2024 2:41 UTC; 10 points)
- What are some other introductions to AI safety? by (EA Forum; 17 Feb 2025 11:48 UTC; 9 points)
- 's comment on A Case for the Least Forgiving Take On Alignment by (5 May 2023 20:32 UTC; 9 points)
- 's comment on My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by (21 Mar 2023 21:40 UTC; 9 points)
- Try to solve the hard parts of the alignment problem by (EA Forum; 11 Jul 2023 17:02 UTC; 8 points)
- Why focus on schemers in particular (Sections 1.3 and 1.4 of “Scheming AIs”) by (24 Nov 2023 19:18 UTC; 8 points)
- 's comment on Interpretability Tools Are an Attack Channel by (18 Aug 2022 6:29 UTC; 8 points)
- 's comment on DragonGod’s Shortform by (5 Feb 2023 20:48 UTC; 8 points)
- Distribution Shifts and The Importance of AI Safety by (EA Forum; 29 Sep 2022 22:38 UTC; 7 points)
- 's comment on Suggestion for safe AI structure (Curated Transparent Decisions) by (6 Apr 2023 22:00 UTC; 7 points)
- What are the differences between a singularity, an intelligence explosion, and a hard takeoff? by (EA Forum; 3 Apr 2025 10:34 UTC; 6 points)
- 's comment on A simple treacherous turn demonstration by (25 Nov 2023 8:01 UTC; 6 points)
- Superintelligence’s goals are likely to be random by (13 Mar 2025 22:41 UTC; 6 points)
- A few Alignment questions: utility optimizers, SLT, sharp left turn and identifiability by (26 Sep 2023 0:27 UTC; 6 points)
- 's comment on Pivotal outcomes and pivotal processes by (24 Jun 2022 22:13 UTC; 6 points)
- Agency overhang as a proxy for Sharp left turn by (7 Nov 2024 12:14 UTC; 6 points)
- 's comment on Evolution provides no evidence for the sharp left turn by (12 Apr 2023 2:22 UTC; 6 points)
- 's comment on Announcing the Winners of the 2023 Open Philanthropy AI Worldviews Contest by (EA Forum; 30 Sep 2023 19:44 UTC; 5 points)
- Some quick thoughts on “AI is easy to control” by (EA Forum; 7 Dec 2023 12:23 UTC; 5 points)
- 's comment on Genetic fitness is a measure of selection strength, not the selection target by (6 Nov 2023 9:44 UTC; 5 points)
- What are the differences between a singularity, an intelligence explosion, and a hard takeoff? by (3 Apr 2025 10:37 UTC; 5 points)
- Morality as Cooperation Part I: Humans by (5 Dec 2024 8:16 UTC; 5 points)
- 's comment on The Alignment Community Is Culturally Broken by (14 Nov 2022 18:13 UTC; 5 points)
- 's comment on What if we don’t need a “Hard Left Turn” to reach AGI? by (EA Forum; 15 Jul 2022 17:30 UTC; 4 points)
- 's comment on On how various plans miss the hard bits of the alignment challenge by (12 Jul 2022 11:18 UTC; 4 points)
- 's comment on Orthogonal’s Formal-Goal Alignment theory of change by (6 May 2023 16:08 UTC; 4 points)
- 's comment on Where I agree and disagree with Eliezer by (25 Mar 2023 17:37 UTC; 4 points)
- 's comment on Human takeover might be worse than AI takeover by (13 Jan 2025 12:38 UTC; 4 points)
- 's comment on Timelines explanation post part 1 of ? by (12 Aug 2022 16:56 UTC; 4 points)
- 's comment on Some conceptual alignment research projects by (16 May 2023 14:32 UTC; 3 points)
- 's comment on What does it mean for an AGI to be ‘safe’? by (7 Oct 2022 19:57 UTC; 3 points)
- 's comment on D0TheMath’s Shortform by (19 Sep 2023 5:23 UTC; 3 points)
- 's comment on Ruining an expected-log-money maximizer by (23 Aug 2023 21:52 UTC; 3 points)
- 's comment on The Preference Fulfillment Hypothesis by (27 Feb 2023 18:19 UTC; 3 points)
- 's comment on The Preference Fulfillment Hypothesis by (1 Mar 2023 18:50 UTC; 3 points)
- 's comment on Singular learning theory and bridging from ML to brain emulations by (1 Nov 2023 22:29 UTC; 3 points)
- 's comment on Age changes what you care about by (27 Oct 2022 23:16 UTC; 3 points)
- 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (17 Jul 2022 21:48 UTC; 3 points)
- 's comment on Zach Stein-Perlman’s Shortform by (26 Dec 2022 2:05 UTC; 3 points)
- Superintelligence’s goals are likely to be random by (EA Forum; 14 Mar 2025 1:17 UTC; 2 points)
- 's comment on Current safety training techniques do not fully transfer to the agent setting by (8 Nov 2024 16:03 UTC; 2 points)
- 's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by (17 Jul 2022 18:30 UTC; 2 points)
- 's comment on Possible miracles by (30 Sep 2024 4:16 UTC; 2 points)
- 's comment on Announcing the Winners of the 2023 Open Philanthropy AI Worldviews Contest by (EA Forum; 1 Oct 2023 15:06 UTC; 1 point)
- 's comment on An open letter to SERI MATS program organisers by (24 Apr 2023 2:00 UTC; 1 point)
- 's comment on The basic reasons I expect AGI ruin by (31 May 2023 22:47 UTC; 1 point)
- 's comment on [linkpost] “What Are Reasonable AI Fears?” by Robin Hanson, 2023-04-23 by (16 Apr 2023 17:36 UTC; 1 point)
- 's comment on The Control Problem: Unsolved or Unsolvable? by (11 Jun 2023 8:51 UTC; 1 point)
- 's comment on Movie Review: Megan by (24 Feb 2023 9:33 UTC; 1 point)
- 's comment on Humans aren’t fitness maximizers by (4 Oct 2022 22:00 UTC; 0 points)
- 's comment on Alignment Implications of LLM Successes: a Debate in One Act by (23 Oct 2023 22:55 UTC; -4 points)
Sharp Left Turn: a more important problem (and a more specific threat model) than people usually think
The sharp left turn is not a simple observation that we’ve seen capabilities generalise more than alignment. As I understand it, it is a more mechanistic understanding that some people at MIRI have, of dynamics that might produce systems with generalised capabilities but not alignment.
Many times over the past year, I’ve been surprised by people in the field who’ve read Nate’s post but somehow completely missed the part where it talks about specific dynamics that lead to alignment properties breaking during capabilities generalisation. To fulfil the reviewing duty and to have a place to point people to, I’ll try to write down some related intuitions that I talked about throughout 2023 when trying to get people to have intuitions on what the sharp left turn problem is about.
For example, imagine training a neural network with RL. For a while during training, the neural network might be implementing a fuzzy collection of algorithms and various heuristics that together kinda optimise for some goals. The gradient strongly points towards greater capabilities. Some of these algorithms and heuristics might be more useful for the task the neural network is being evaluated on, and they’ll persist more and what the neural network is doing as a whole will look a bit more like what the most helpful parts of it are doing.
Some of these algorithms and heuristics might be more agentic and do more for long-term goal achievement than others. As being better at achieving goals correlates with greater performance, the neural network becomes, as a whole, more capable of achieving goals. Or, maybe the transition that leads to capabilities generalisation can be more akin to grokking: even with a fuzzy solution, the distant general coherent agent implementations might still be visible to the gradient, and at some point, there might be a switch from a fuzzy collection of things together kind of optimising for some goals into a coherent agent optimising for some goals.
In any case, there’s this strong gradient pointing towards capabilities generalisation.
The issue is that a more coherent and more agentic solution might have goals different from what the fuzzier solution had been achieving and still perform better. The goal-contents of the coherent agent are stored in a way different from how a fuzzier solution had stored the stuff it had kind of optimised for. This means that the gradient points towards the architecture that implements a more general and coherent agent; but it doesn’t point towards the kind of agent that has the same goals the current fuzzy solution has; alignment properties of the current fuzzy solution don’t influence the goals of a more coherent agent the gradient points towards.
It is also likely that the components of the fuzzy solution undergo optimisation pressure which means that the whole thing grows towards the direction near components that can outcompete others. If a component is slightly slightly better at agency, at situational awareness, etc., , it might mean it gets to have the whole thing slightly more like it after an optimisation step. The goals these components get could be quite different from what they, together, were kind of optimising for. That means that the whole thing changes and grows towards parts of it with different goals. So, at the point where some parts of the fuzzy solution are near being generally smart and agentic, they might get increasingly smart and agentic, causing the whole system to transform into something with more general capabilities but without gradient also pointing towards the preservation of the goals/alignment properties of the system.
I haven’t worked on this problem and don’t understand it well; but I think it is a real and important problem, and so I’m sad that many haven’t read this post or only skimmed through it or read it but still didn’t understand what it’s talking about. It could be that it’s hard to communicate the problem (maybe intuitions around optimisation are non-native to many?); it could be that not enough resources were spent on optimising the post for communicating the problem well; it could be that the post tried hard not to communicate something related; or it could be that for a general LessWrong reader, it’s not a well-written post.
Even if this post failed to communicate its ideas to its target audience, I still believe it is one of the most important LessWrong posts in 2022 and contributed something new and important to the core of our understanding of the AI alignment problem.