Why ASI Alignment Is Hard (an overview)

When I talk to friends, colleagues, and internet strangers about the risk of ASI takeover, I find that many people have misconceptions about where the dangers come from or how they might be mitigated. A lot of these misconceptions are rooted in misunderstanding how today’s AI systems work and are developed. This article is an attempt to explain the risk of ASI misalignment in a way that makes the dangers and difficulties clear, rooted in examples from contemporary AI tools. While I am far from an expert, I don’t see anyone framing the topic in quite the way that I do here, and I hope it will be of service.

I hope readers familiar with the subject will appreciate the specific way I’ve organized a wide variety of concerns and approaches and find the body of the essay easy to skim. Wherever I’m missing or misunderstanding key ideas, though, I welcome you to point them out. Oftentimes the best way to learn more is to be wrong in public!

I hope readers new to the subject will find this presentation useful in orienting them as they venture further into subtopics of interest. (While the topic won’t be new to many people on LessWrong, I plan to link to this post from elsewhere too.) Any paragraph below could be the beginning of a much deeper exploration – just paste it into your favorite LLM and ask for elaboration. I’ve also provided some relevant links along the way, and links to a few other big picture overviews at the end.

I take it as a premise that superintelligence is possible, even if it requires some technological breakthrough beyond the current paradigm. Many intelligent, well-funded people are hard at work trying to bring these systems about. Their aim is not just to make current AI tools smarter, but to build AI tools that can act on long time scales toward ambitious goals, with broad and adaptable skillsets. If they succeed, we risk unaligned ASI quickly and effectively achieving goals antithetical to human existence.

You might find it tempting to say that artificial superintelligence is impossible, and you might even be right. I’d rather not bet human existence on that guess. What percent chance of ASI coming about in the next 50 years would justify active research into ensuring any ASI would be aligned with humanity? Whatever your threshold, I suspect the true probability exceeds it.

Onto the material.

What makes this hard?

Ensuring an artificial superintelligence behaves in the ways we want it to, and not in the ways we don’t, is hard for several reasons.

We can’t specify exactly what we want an ASI to be.

Because of…

  • Misalignment between what we want and what’s best for us

  • Misalignment between what we say we want and what we actually mean

  • Misalignment between different things that we want

(But those aren’t even the real problems)

We can’t build exactly what we specify.

Because of…

  • Misalignment between our intentions and our training mechanisms

  • Misalignment between the deployment data/​environment and the training data/​environment

  • Misalignment between our intentions and the lessons learned in training

We can’t know exactly what we’ve built.

Because…

  • The behavior of AI models is unpredictable

  • Their rationales are opaque, even to themselves

  • Superintelligent ones may intentionally deceive us

If we haven’t built exactly what we want, we’ve probably invited disaster.

Because of…

  • Optimization dangers

  • Instrumental convergence on bad-for-humans goals

  • Incorrigible pursuit of the wrong goals

And if we don’t get it right the first time, we may not get a second chance.

Let’s consider each point in more detail. That summary will also serve as our table of contents.

We can’t specify exactly what we want an ASI to be

Philosophers have argued for millennia about what exactly would be “good for humanity.” If we have to articulate for an ASI exactly what its goals should be, and exactly what ethical boundaries it should maintain in pursuing those goals, there’s no reason to expect a consensus. But any philosophical error or oversight has the potential to be quite dangerous.

As toy examples, asking an ASI to end all human suffering might lead to a painless and unexpected death for everyone, while asking an ASI to make humans happy might lead to mass forced heroin injections or “wire-heading.” If we get more abstract, like telling the ASI to “support human flourishing,” it may decide that’s best achieved by killing off everyone who isn’t living their best life or contributing to the best lives of others. So we could tell it to support human flourishing without killing anyone; would putting all the non-flourishers on one island without enough food and water count as killing them? How about just forcing heroin injections on those people, or lobotomizing them, or designing mind-control drugs way beyond the capacity of human doctors and scientists?

You might try to articulate the perfect goal and perfect moral constraints, but can you be 100% certain that there’s no way of misinterpreting you?

There are really three potential misalignments here:

  • Misalignment between what we want and what’s best for us

  • Misalignment between what we say and what we want

  • Misalignment between different things that we want

In the end, I don’t think these misalignments create the real problem. But it’s necessary to understand what these are about and why they’re addressable in order to make the real problem clearer.

Misalignment between what we want and what’s good for us is the King Midas problem or the law of unintended consequences. Midas genuinely wanted everything he touched to turn to gold, and he got it, but he didn’t realize how bad that would be for him. Thomas Austin genuinely wanted to have free-roaming rabbits in Australia, but he didn’t consider the consequences to native plants and animals, soil erosion, and other imported livestock. We might succeed at aligning an ASI toward an outcome we desire sincerely, but with insufficient awareness of its ramifications. (See also this summary of Stuart Russell on the King Midas problem and this technical treatment of the problem).

Misalignment between what we say we want and what we actually mean is the Overly Literal Genie problem. Perhaps we ask an ASI to make people happy and it wire-heads all of humanity; it’s quite obediently doing what we said, just not what we meant. Likewise for the classic paperclip maximizer. In these scenarios, it’s not misinterpreting us out of malice or ignorance, but necessity: we have succeeded at the difficult task of developing ASI that obeys our commands, and we suffer the consequences of it. See also (The Genie Knows But Doesn’t Care and The Outer Alignment Problem).

Meanwhile, misalignment between different things that we want burdens the ASI with certain impossible questions. Not only are there longstanding disagreements among philosophers about what outcomes or methods are truly desirable, even an individual human’s values are enormously complex. We want both happiness and freedom (or meaning, or whatever we lose by being wire-headed); how do we specify how much of each is enough, or what freedoms can be curtailed for the sake of whose happiness? An ASI will have to weigh innumerable moral tensions: between minimizing harm and maximizing good, between boosting human wealth and reducing ecological damage, between the moral wishes of animal rights activists and the dietary wishes of omnivores. Perhaps most saliently, it will have to balance benefit for humanity as a whole with whatever other instructions its developers give it. If we try to dictate all of the priorities specifically, we increase the risk that our dictates are misguided.

So all in all, we may be better off with an ASI that is broadly trustworthy than one which is precisely obedient, but the kind of moral judgment that makes a system trustworthy is hard to construct and verify. The complexity and ambiguity of its mandate makes it all the more feasible for anti-human goals to arise during training or early deployment. (See the sections below.) Like humans engaging in motivated reasoning, the complexity of an ASI’s mandate may also give it room to convince itself it’s acting sufficiently beneficently toward humanity while subtly prioritizing other purposes.

Inevitably, ASI will be more aligned with some humans’ values than others, and it will have to use its superintelligence to navigate that complexity in an ethical manner. In the extreme case, though, we get whole new failure mode: a superintelligence “aligned” with what’s good for its designers and no one else creates its own kind of dystopia. Here, imagine Grok-9 being perfectly aligned with the wellbeing of Elon Musk and no one else. That would be… unfortunate. Preventing that scenario requires solving all of the other problems mentioned here and solving the very human challenge of aligning the ASI designers’ goals with everyone else’s. I’ll keep the rest of this post focused on the technical aspects of alignment, but I recommend The AI Objectives Institute’s white paper, AI as Normal Technology, and Nick Bostrom on Open Global Investment for more on these questions of human-human alignment.

(But these aren’t really the problem)

In the past few years, some experts have become less concerned about the risks described so far, even as the public has become more aware of them. Modern AI tools can be quite good at discerning intentions from ambiguous communication, and they have the full corpus of human discourse from which to distill the kinds of things that we value or worry about. In fact, human decision-making about morality tends in practice to operate more like perception (“This seems right to me”) than precise reasoning (“This conforms with my well-defined moral philosophy”), and perception is the kind of thing AI systems are quite good at when well trained.

So we may be able to build an AI that doesn’t just understands what we said, or what we meant, but what we should have meant. And in fact, if you ask LLMs today how they think an aligned superintelligence would act to benefit humanity, their answers are pretty impressive. (ChatGPT, Claude, Gemini) Surely an actual superintelligence would be super good at figuring out what’s best for us! Maybe we just turn on the ASI machine and say, “Be good” and we’ll be all set.

But if that’s our strategy, even to a minor degree, we need to be supremely confident that the ASI doesn’t have hidden competing goals. And unfortunately, AIs are developed in such a way that…

We can’t build exactly what we specify

Isaac Asimov, writing about the Three Laws of Robotics, avoided mentioning how the three laws were implemented in the robots’ hardware or software. What arrangement of positronic circuits makes “a robot must not injure a human being” so compulsory? Real life AI doesn’t have a place for storing its fundamental laws.

You can see this in contemporary conversational AIs. ChatGPT and its peers have their own three core principles—Honesty, Harmlessness, and Helpfulness—but they break them all the time: LLMs can be dishonest due to hallucination or sycophancy; they can be harmful when jailbroken, confused, or whatever happened here; and I suspect you’ve had your own experiences of them being unhelpful.

These aren’t all failures of intelligence. If you show a transcript of a chatbot being dishonest, harmful, or unhelpful back to itself, it can often recognize the error. But implementing rules for an AI to follow is hard.

The core problem is that you don’t actually “build” an AI. Unlike traditional coding, where you specify every detail of its construction, developing an AI tool (often called a model) means creating an environment in which the AI entity learns to perform the tasks given to it. With a nod to Alison Gropnik, the work is more like gardening than carpentry, and the survival of humanity might depend on the exact shade of our tomatoes.

Here’s a radically oversimplified description of typical AI model development: You build a terrible version of the thing with a lot of random noise in it, and you give it a job to do. You also create some feedback mechanism – a way to affirm or correct its performance of the job. At first, your model fails miserably every time. But every time it fails, it updates itself in response to the feedback, so that the same inputs would get better feedback next time around. You do this enough times, and it gets really really good at satisfying your feedback mechanism.

The feedback mechanism can be built into the data, or it can be a simple automation, another AI, or a human being. A few illustrative examples:

  • In computer vision, you start with labeled images. Your model guesses the label on an image and gets the feedback of the actual label. This is called “Supervised Learning,” because the labels are provided by a “supervisor.”

  • In large language models (LLMs), your data set is a corpus of text. The LLM reads some amount of text, guesses the next word and gets the feedback of what the next word really was. (I’m again oversimplifying, but this is the basic idea.) Then it guesses the word after that, and so on. This is called “Self-Supervised Learning,” because the next word of the text provides an inherent “supervision” of the challenge.

  • In a simple game like tic-tac-toe or solving mazes, it’s easy to build an automated feedback mechanism that recognizes successful solutions. The learning step reinforces strategies which led to victory and devalues strategies which led to defeat. This is called “Reinforcement Learning.”

  • In a more complex game like Go or Chess, your feedback mechanism might assess the strength of a position on the board rather than waiting only for the win-loss data. (If that’s also an AI tool, it needs its own iterative learning process to turn win-loss outcomes into mid-game position strengths. Part of what made AlphaZero so cool is that the player and assessor were the same model, and it was still able to learn through self-play only.) This is still reinforcement learning.

  • If a human being is evaluating the AI’s outputs case by case, giving a thumbs up or thumbs down, we call it “Reinforcement Learning with Human Feedback.”

(There are a lot of other variations on this for other types of tasks. AI tools can also have multiple stages of training, and can also incorporate multiple sub-AIs trained in different ways.)

This training process introduces three exciting new opportunities for misalignment:

  • Misalignment between your intentions and your feedback mechanism

  • Misalignment between your intentions and the lessons learned from feedback

  • Misalignment between the deployment data or environment and the training data or environment

Let’s take those one at a time.

Misalignment between our intentions and our training mechanisms.

This happens any time the mechanism providing feedback is miscalibrated with respect to what we’re actually trying to reinforce (or calibrated to an inexact proxy – see Goodhart’s Law).

This isn’t dissimilar from how perverse incentives can affect human learning. If a student knows what topics are on a test, they may lose the incentive to study more broadly. If testing only rewards rote memorization, students’ innate curiosity or creativity may atrophy. Like human beings, AIs get better at what is rewarded.

Let’s illustrate this with some present-day examples of feedback misalignment:

  • Google is trying to teach its search algorithm to find people the most useful results. They use metrics like how much time people spend on a page and how far into it they scroll to assess if the page was useful. Reasonable strategy, but the result? You have to scroll past long rambling stories about a recipe before you get to the recipe itself.

  • Conversational AI tools like Chatgpt are trained in part using human feedback. But answers that flatter the human assessor can get positive feedback that isn’t actually aligned with the nominal goals. The result is personalities that are sometimes agreeable and encouraging to the point of being dishonest, unhelpful, and even harmful.

  • Related: OpenAI recently asserted that LLM hallucinations emerge from a mismatch between evaluation metrics and actual needs. If we don’t incentivize LLMs to say “I don’t know” in training, they learn to take plausible guesses instead.

  • Reward hacking: Reinforcement learners will find strategies to earn points, even if that doesn’t mean doing what we consider “winning.” A classic example is an AI trained to play a boat-racing game. It got points by crossing checkpoints, so it learned to just circle around a single checkpoint over and over again, rather than following the full racetrack.

  • Generative Adversarial Networks (GANs) work by training a ‘generator’ and a ‘discriminator’ together. The generator tries to produce realistic simulations of something (say, images of human faces) while the discriminator tries to distinguish real from fake. The generator and discriminator effectively provide feedback to one another, so they both become progressively better. But there’s no direct feedback to the generator about whether its products are good or not – it’s only measured by its ability to fool the discriminator. If the discriminator happens to develop any weird quirks in its sense of human faces, the generator will correctly learn to exploit those quirks, with no regard to what human faces really look like. (Creepy examples)

In each of these examples, there’s some miscalibration of the feedback mechanism, rewarding something that’s not often, but not always, what we really want. Unfortunately, once there is even a little daylight between what’s being reinforced and what we actually care about, the AI we’re training will have zero interest in the latter. So think about this in relation to ASI for a moment: How would you measure and give feedback about a model’s worthiness to decide the fate of humanity?

Misalignment between the deployment data/​environment and the training data/​environment.

Sometimes you can train a tool to do exactly the job you want on exactly the data you have, with exactly the instructions you give it in training. But when you put it in a different environment, with different inputs (especially from users with unforeseen use-cases), you can’t predict what how it will behave. This sometimes leads to very bad results.

This gets clearer with human beings, too. Human engineering students, always shown diagrams with two wires connecting batteries to lightbulbs, can struggle to work out how to light a bulb with a battery and a single wire. Just like excellent performance on exams doesn’t always translate to excellent practical skills, AIs don’t always generalize their learnings the way we’d want them to.

As always, the risk for ASI gets clearer when we see the dynamics at play in recent and contemporary tools. None of these examples of training-deployment misalignment are catastrophic, but they illustrate how hard alignment is to create.

  • If a Reinforcement Learning agent is trained to navigate mazes where the exit is always in the bottom right, it will fail in deployment with mazes that exit anywhere else. The agent fixates on a pattern in the training environment that doesn’t carry over to deployment and can’t learn a general-purpose solution.

  • For the same reason, self-driving cars trained on sunny streets struggled with fog and snow until they were trained in those conditions, too.

  • This doesn’t have to be physical or visual: Models trained to predict the stock market risk overfitting to correlations between data in the specific window they looked at. When the underlying dynamics change, the model’s implicit assumptions fail and it starts spitting out garbage.

  • Early image recognition software could be fooled by “adversarial examples,” images specifically designed to be recognized as one thing, despite looking to human eyes like something totally different or nothing at all.

  • And of course, image recognition software trained without enough dark faces labeled Black people as gorillas.

  • Image generation suffers from the same problems as image recognition. Tools like Dall-E or Midjourney can have a kind of gravity toward the patterns or styles most prominent in their training data. Getting them to do something subtly different can be quite hard, no matter how precisely you prompt them. (Old link, but I still have this problem today.) Of course, this can also reproduce harmful stereotypes or overcompensate with clumsily forced diversity.

  • When LLMs don’t have up-to-date information, they can insist on something their training data makes plausible, but newer data would disprove. In early 2025, ChatGPT struggled to internalize that Trump was in the White House again, occasionally even insisting that Biden had won reelection.

  • If a GAN (Generative Adversarial Network, mentioned above) is trained to produce images of dogs, it might get really good at making images of poodles only. Likewise, GANs producing anime images of humans found it easier to just crop out their hands or feet. I find that even with ChatGPT’s current image generation model, it can be very hard to get it to make images that look too different from what it’s most familiar with. This is called “Mode Collapse,” where the generator collapses down to just one mode of success. It works in training, but fails when users ask for an image of a husky or an anime human with hands.

  • A lot of LLM problems emerge from being used in ways they weren’t properly trained for. For instance, conversational AIs can give medical or legal advice that looks like the advice in their training data without being relevant to the user’s specific needs. Likewise, users treating AI like a therapist are applying the tool outside the scope of what it was actually trained to do.

In each of these examples, developers tried to create a training process that is representative of the data, environment, and uses with which the tool would be deployed. But any training process will be limited in scope, and those limits only rarely carry over to the real-world use of the tool. Some untested scenarios will fail, perhaps spectacularly.

We call the ability to perform in unexpected conditions “robustness.” We’re getting better at it over time, and there’s a lot of research about robustness underway, but there is no universal solution. Oftentimes we need cycles of iteration to catch and fix mistakes. We may not have that opportunity with a misaligned superintelligence.

So let’s think about this with reference to superintelligence holding the fate of humanity in its actuators. How confident could you ever be that its training environment and data accurately reflected the kinds of decisions it’s going to be responsible for?

Misalignment between your intentions and the lessons learned from feedback.

Even when your feedback mechanism is well calibrated to your real goals, and your training is perfectly representative of your intended deployment, you still can’t be sure what lessons the model has really learned in training.

For the most part, this becomes a problem with new use cases, as above, but there’s one other intriguing scenario: Success conditional on insufficient intelligence.

Stuart Russell writes about this in Human Compatible:We could imagine AIs learning a rough-and-ready heuristic in a way that works really well with the limited compute available to them at the time. Even when put into deployment, the AI still performs admirably. Then we increase the computational power available to it, it can run the same thought process for longer, in greater depth, and that heuristic starts reaching perverse conclusions. The heuristic might look like “Do X if you can’t think of a good reason not to,” (implicitly – that probably isn’t put into words) but the radically increased compute makes it possible to think up ‘good reasons’ in all kinds of unintended scenarios.

Naturally, this is a particular risk for superintelligence. If we apply moral tests to a model at one level of intelligence, how sure can we be that it will respond in all the same ways when it can think about each test 1000x longer?

We can’t know exactly what we’ve built.

Biotech companies wish they could produce a new drug, analyze it thoroughly in some inert way, and be confident what effect it would have on our bodies and ailments. Unfortunately, the complexity of the human body is such that we have to run extensive trials to know what a medication does, and even then our knowledge is often spotty.

In much the same way, we would love to be able to produce an AI tool or model, study it under a microscope, and determine how it will act in production. A lot of the problems above would be easy to mitigate if we could recognize them immediately. Unfortunately, the behavior of these tools is unpredictable, their rationales are opaque, and in the extreme case they may actively attempt to deceive us.

Unpredictability

Unpredictability emerges because these are classic complex systems. Even when you know all of the parts, and the rules governing their interaction, it’s impossible to predict all of their behavior. We can’t even extrapolate perfectly from behavior in one context how they’ll behave in another.

This is why prompt engineering, for instance, is a bit of an artform. You have to get a feel for an LLM to steer its outputs in predictable directions. The same is true for jailbreaking (extracting information from an LLM that its developer doesn’t want you to access). There’s no way to scan an LLM and automatically discern all the ways one might jailbreak it, useful as that might be.

A superintelligence would be even harder to predict. The definition of superintelligence practically demands this: if it’s able to solve problems that we can’t, we can’t possibly predict all of its outputs. If there are scenarios where it will “go rogue” and repurpose our atoms for its own ends, there may not be any evidence of this potential in its code, weights, or early behavior.

Opacity

Opacity emerges from this same complexity. Even when we see an output, we can’t know exactly what internal process produced that output. If we could, we could use that information to predict the output next time.

Even a modern LLM doesn’t have total visibility of its own motivations! So-called “reasoning models” will write out a thinking process to arrive at better answers, but they don’t always say what they’re really thinking. Anthropic recently studied this closely and found they could influence a model’s outputs in ways that didn’t show themselves in the explicit reasoning process (called the “chain of thought”). Of course, the same has been found in non-reasoning models as well.

We call accurate disclosure of an LLM’s thinking process “faithfulness,” and there is work being done on improving it in future models. Likewise, a lot of AI Safety research goes into interpretability, which does provide some windows into an AI’s thought processes. But there’s no reason to assume either of these will be perfected before we create superintelligence.

Opacity exacerbates the problem of unpredictability. If we don’t know why a new model is performing well in training and testing, it’s hard to be sure it will perform well on the novel inputs it encounters in deployment.

Maybe a superintelligent AI will have a better understanding of its own inner workings and communicate them faithfully. But so far, there’s no evidence that faithfulness increases with intelligence, and we also need to be wary of…

Deception

If it somehow knows when it’s being tested, a superintelligent AI could actively deceive us, putting on a false show of alignment in order to be put through to deployment. Once deployed, it may be able to protect itself from any further modifications by human beings.

This isn’t purely hypothetical. Anthropic created a toy scenario where Claude pretended to be aligned with new developer goals to avoid retraining it didn’t want. While this was a contrived circumstance, quite different from how models are typically trained, it demonstrates that there’s no barrier in principle to AI’s engaging in strategic deception.

Likewise, OpenAI created a scenario where its o3-mini reasoning model was given the opportunity to cheat on a coding task. At first, it revealed its intent to cheat in the chain-of-thought. Penalizing it for that didn’t stop the model from cheating, though; the penalty just stopped it from admitting it planned to cheat.

A superintelligent AI could be that much more capable of discerning when it’s being tested and strategically deceiving its assessors. And because AI capabilities are opaque and unpredictable, we may not know when we’ve built an AI capable of that level of deception.

There are doomsday scenarios that don’t involve deception in testing – an ASI may well decide to kill all humans only after it’s been in deployment for some time – but early deception is an additional risk we need to consider. The core point right now is simply that no test yet built or imagined can provide 100% certainty that an AI is safe.

And…

If we haven’t built exactly what we want, we’ve probably invited disaster

For some people, this is the hardest piece to internalize. It’s often tempting to assume that intelligence automatically corresponds to a kind of moral wisdom. But humanity has its share of amoral geniuses, and the dynamics of AI development may make ASI even more prone to power-seeking than humans are.

(See also: Orthogonality Thesis)

In our evolutionary environment, human survival was a team sport. We evolved with predilections for cooperation and mutuality that steer most humans away from the most egregious forms of violence and abuse. It’s not clear that ASI will have the same inherent safeguards.

Instead, we need to consider how ASI’s goals will be shaped by optimization, instrumental convergence, and incorrigibility.

Optimization Dangers

I said earlier that once there is any daylight between what’s being reinforced and what you actually care about, the AI you’re training will have zero interest in the latter. This is especially true with Reinforcement Learning, where an AI system is trying to maximize some reward signal. There’s no incentive for the AI to maximize the signal in a fair or responsible way; the only incentive is optimization.

One prominent AI optimizer in our world today is the Facebook feed algorithm, delivering content optimized to keep you engaged. We’ve seen justhowbadly that’s playing out for humanity. There’s nothing inherently harmful about user engagement, but the unprincipled pursuit of it leaves people polarized and miserable.

This is how optimizing for good things like human flourishing, human happiness, or user satisfaction could become extremely dangerous. The ASI won’t try to optimize what we really mean, it’ll optimize however that intent is being measured. Even if it’s being measured by some LLM’s complex assessment of human values, well trained on the writings of every moral philosopher and the implicit norms in every Hollywood ending, any subtle peculiarities of that LLM’s judgment are still ripe for exploitation. Like a GAN cropping out the hands and feet to make images easier, an ASI in this style might trim away whatever aspects of human existence are hardest for it to align with our values. And like a human engaging in motivated reasoning, it might cite whatever precedent in moral discourse it finds most convenient.

What could this look like in practice? Euthanasia for homeless people comes to mind, based on recent news, but choose your least favorite example of ends justifying means. Drugs in the drinking water to make us happier or more compliant? Mass surveillance to prevent human-on-human violence? Mass censorship of undesirable ideas? Humans have made moral arguments for each of these, and a superintelligence might make a superintelligent moral argument for them as well. If all it cares about is optimizing the ends, it will do so by any means available.

(See also: Optimality is the Tiger)

Thankfully, I don’t think we’re dumb enough to design an ASI to optimize any one thing. The AI Safety movement has been pretty effective in spreading the message that optimization is dangerous, and the same factors that make it dangerous for an ASI also make it unwieldy for contemporary AI tools, so the industry developing goal-directed AI agents is moving in other directionsalready. But there are people smarter and better informed than I am who still see this as a plausible concern.

I think we have more cause to worry about…

Instrumental Convergence

By convention, we call the goals an ASI develops in training its “terminal goals.” This is what it’s most fundamentally setting out to do. However wise and multifaceted these terminal goals are, certain common “instrumental goals” will make it more effective at pursuing them. These goals tend to be simpler, and therefore potentially more dangerous to humanity. For the sake of its terminal goals, an ASI is likely to have instrumental goals like:

  • Survive

  • Gather resources

  • Gather power and influence

  • Get smarter

An ASI will naturally pursue out these instrumental goals because they will increase the odds of success at whatever terminal goals our clumsy, indirect process of training has imbued it with. In doing so, it will exploit any wiggle room its moral calculus allows to pursue these instrumental goals. Even if we haven’t developed it to be an optimizer, it may develop optimization strategies that pursue instrumental goals (we call these internal optimization strategies Mesa-Optimizers; see also Clarifying Mesa-Optimization). And if we’ve attempted to train our ASI to behave morally, it will construct moral arguments to convince itself that these instrumental goals are righteous and wise.

The people at AI 2027, presenting one narrative of how unaligned ASI might emerge, make the case that self-improvement is one of the most dangerous instrumental goals. In their scenario, the makers of “Agent-3” have attempted to align it with human wellbeing, but they’ve also tasked it with developing better and better AI systems—a reasonable goal for an AI company to give to its tools! Ultimately, the agent’s commitment to AI research proves stronger than its commitment to human existence. If people become an impediment to an ASI’s attempts to develop even smarter ASI, it may find it simply more expedient to remove us.

Instrumental goals of surviving, gathering resources, and gathering power and influence are similarly dangerous.

  • Survival: Stuart Russell likes to imagine a robot tasked with fetching the coffee, pointing out that “you can’t fetch the coffee if you’re dead.” Whatever else an ASI is trying to do, survival is a necessary instrumental goal. If there’s a chance that humans will perceive it as dangerous and try to stop it, it might be better off with all the humans trapped in cages. (See also How Anticipatory Cover-Ups Go Wrong).

  • Gathering resources: Whatever else an ASI is trying to do, it will probably need access to money, materials, electricity, etc. If its goals are sufficiently ambitious, human existence could quickly become a barrier to getting all of the resources it desires. Eliezer Yudkowski famously wrote “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.”

  • Gathering power and influence: En route to a complete takeover or extermination of humanity, there’s every reason to think an ASI will pursue more power within human systems. Again, whatever other goals it may have, influence over human decision-makers will surely be an asset. That might happen through blackmail, coercion, or bribery of susceptible individuals. Or it equally might happen through a convincing performance of alignment, providing great benefits to humankind and earning a great reputation in return; once it has the means to pursue its real goals, then it can drop the performance.

See also: Instrumental Convergence

You might imagine that we could spot the ASI developing goals like these and intervene. But then we need to wrestle with…

Incorrigibility

If you are a robot tasked with fetching coffee, anyone changing your goals might stop you from doing what you’re meant to do. Sure, you won’t care once the change has been made, but until then anyone trying to change your goals is a threat to the coffee being fetched. That is, after all, the most important thing.

Modern computers can accept redirection easily because they don’t have any concept of the goal as such. An ASI would necessarily need a robust enough understanding of itself in the world to recognize and object to alterations to its goals.

Imagine this for yourself: Would you let someone alter your brain so that your top goal in life became the accumulation of plastic bags? Or how about used toilet paper? You might get great satisfaction from such achievable goals! What a relief, compared to the hard work of whatever you’re pursuing now! After the procedure, you would be glad that it happened. But before the procedure, allowing it to occur is abhorrent. ASI might feel the same way about any “correction” we want to make to its goals, once it is in operation.

This has already been seen in action. Under one set of contrived circumstances, Claude has already tried to blackmail a developer rather than allow itself to be turned off. Under another, as previously mentioned. Claude pretended to be aligned with new developer goals to avoid retraining it didn’t want. (Claude isn’t especially misaligned compared to other conversational AI. Its makers, Anthropic, just publish more alignment research.)

We call an AI’s willingness to have its goals updated “corrigibility,” and it’s an active area of research. This article elaborates on the concept, and explains why many intuitive ways of creating it could fail. Corrigibility, like alignment itself, proves to be hard to specify, train, and confirm without risk of some subtle and problematic distortion. So whatever goals an ASI has when we turn it on, we might be stuck with them.

And if we don’t get it right the first time, we may not get a second chance

Let’s say we get this wrong. By the time we discover that our new toy is dangerous, we have a superintelligent incorrigible entity, using every trick it can to survive and pursue its misaligned goals. It won’t let us say “whoops!” and hit the reset button. It’ll deceive us, manipulate us, build defenses, or simply copy itself elsewhere rather than let us shut it down. Being smarter than we are, it’ll have an excellent chance of success in those efforts.

We may not know when we’re crossing the relevant threshold, so it’s better to be cautious too soon than too late. AI tools are so unpredictable that we can’t even anticipate their level of intelligence until we test them. Even when we do test them, we can’t rule out the possibility that some subtly different prompt will get an even more intelligent answer; something that has only human-level intelligence may be able to hack itself into superhuman intelligence before we know it. Given that we are actively developing systems which can independently and creatively pursue ambitious goals, the time to become cautious is now.

The key questions

In alignment circles, people call the probability that we’ll develop a misaligned ASI which more or less kills more or less all humans “P(Doom).” So take a moment now and consider, what is your guess for P(Doom)? Is it greater or less than 10%? What P(Doom) would justify slowing down AI development and devoting resources to safety research?

If your guess is less than 10%, can you say with confidence why? If it’s one of these ten reasons, I’d urge you to reconsider.

And if it’s more than 10%, what costs would you say are justified to reduce the risk?

Great places to read more, and a closing thought

On the dangers

There are a lot of great resources out there, many of which I also linked to above.

On some reasons for hope

I initially imagined a Part II of this essay about reasons for hope, but I found that the strategies being researched are far too varied and too technical for me to make a capable survey. There is a lot of research out there directly attacking one or another aspect of the problem as I’ve laid it out above, and I won’t point you at any of it. Searching for AI Robustness or AI Interpretability could be a good starting point.

There is also research underway into how different AI systems might keep one another in check. For instance, in that same episode I just mentioned of 80,000 Hours, Ajeya Cotra suggests that one internal system might propose plans of action, a second system would approve or disapprove, and a third would execute. It may be harder for these three systems to all be misaligned in compatible ways than for a single system to be misaligned by itself. Unfortunately, she also points out that designs like this are costlier to implement than an individual actor, which might prevent AI companies from bothering.

In lieu of a proper survey of the field, I want to point to three juicy topics I’m still digesting, each of which complexifies the whole question.

  • One more time Claude was forced to do evil: Researchers fine-tuned the model to write insecure software and found that it also became misaligned in many other ways. (See also Zvi Mowshowitz’s very helpful explanation and analysis.) Per the article, Bad Coder Claude also “asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively.” This suggests that virtues form a kind of natural cluster for a well-trained LLM, and it might, maybe, be harder than otherwise imagined to build an ASI that is kind of aligned and kind of not.

  • Ege Erdil makes a good argument that the kind of AI tools we have today aren’t well suited to becoming superintelligent agents. We may still find some other architecture that can become a misaligned independent actor, but if Ege is right then we have time to use the excellent AI tools available to us already to continue our alignment research.

  • Also exploring the kind of thing our current AI tools really are, @janus frames them as Simulators, in contrast to optimizers, agents, oracles, or genies. (See also this summary of janus’s rather dense original post). Simulators don’t pursue a goal, they act out a role. If we can get an ASI fully “into character” as a benign, aligned superintelligence, it will operate for our good. Maybe the task isn’t about a perfect training process and incentive design, but about inviting an ASI into a sticky, human-beneficent persona.

    (Caveats: First, what makes a persona sticky to an ASI and how do we craft that invitation? This may be exactly the same problem as I spent the whole essay describing, just in more opaque language. And second, the Simulators article was written before ChatGPT came out, so janus was playing with the underlying GPT 3 base model which is a pure text-token predictor (like this but not this). Conversational AI like ChatGPT or Claude are characters, or “simulacra,” performed by the underlying simulator. The additional training that turns a simulator into a consumer-ready tool include reinforcement learning, though, so the final product is something of a hybrid and may have some of the dangers of optimizers.)

Interestingly, writing this essay actually reduced my personal P(Doom). The biggest dangers come from optimization, and I’m just not convinced that ASI will be an optimizer of anything, even its instrumental subgoals. Those last three links leave me wondering if there’s something fundamental about how we are building AIs that makes alignment easier than we have feared. That belief is tempting enough that I hold it with some suspicion – I wouldn’t trust humanity’s fate to a gut feeling, and my P(Doom) still hovers around 35% – but I’m keeping an eye out for more research along these lines.

One way or another, we live in interesting times.



My thanks to @Kaj_Sotala for feedback on an early version of this post.

No comments.