Worlds Where Iterative Design Fails

In most technical fields, we try designs, see what goes wrong, and iterate until it works. That’s the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice.

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.

By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.

Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons:

  • Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try.

  • Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can’t tell there’s a problem just by trying stuff and looking at the system’s behavior.

… but these certainly aren’t the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I’d encourage you to think on your own about others. These are the things which kill us; they’re worth thinking about.

Basics: Hiding Problems

Example/​Analogy: The Software Executive

Imagine that a software company executive, concerned about the many errors coming from the software, creates a new incentive scheme: software developers get a monetary reward for changes which decrease the rate of error messages showing up on the manager’s dashboard, and get docked for changes which increase the rate of error messages.

As Tyler Cowen would say: “solve for the equilibrium”. Obvious equilibrium here: the developers stop throwing error messages when they detect a problem, and instead the software just fails silently. The customer’s experience remains the same, but the manager’s dashboard shows fewer error messages. Over time, the customer’s experience probably degrades, as more and more problems go undetected.

In the short run, the strategy may eliminate some problems, but in the long run it breaks the iterative design loop: problems are not seen, and therefore not iterated upon. The loop fails at the “see what goes wrong” step.

Why RLHF Is Uniquely Terrible

The software executive’s strategy is the same basic idea as Reinforcement Learning from Human Feedback (RLHF). AI does something, a human looks at what happened to see if it looks good/​bad, and the AI is trained on the human’s feedback. Just like the software executive’s anti-error-message compensation scheme, RLHF will probably result in some problems actually being fixed in the short term. But it renders the remaining problems far less visible, and therefore breaks the iterative design loop. In the context of AI, RLHF makes it far more likely that a future catastrophic error will have no warning signs, that overseers will have no idea that there’s any problem at all until it’s much too late.

Note that this issue applies even at low capability levels! Humans overlook problems all the time, some of those mistakes are systematic, and RLHF will select for places where humans systematically overlook problems; that selection pressure applies even when the neural net lacks great capabilities.

Net learns to hold hand in front of ball, so that it looks to a human observer like the ball is being grasped. Yes, this actually happened.

This is the core reason why I consider RLHF uniquely terrible, among alignment schemes. It is the only strategy I know of which actively breaks the iterative design loop; it makes problems less visible rather than more.

Generalization: Iterate Until We Don’t See Any Problems

More generally, one of the alignment failure modes I consider most likely is that an organization building AGI does see some problems in advance. But rather than addressing root causes, they try to train away the problems, and instead end up training the problems to no longer be easily noticeable.

Does This Prove Too Much?

One counterargument: don’t real organizations create incentives like the software executive all the time? And we have not died of it.

Response: real organizations do indeed create incentives to hide problems all the time, and large organizations are notorious for hiding problems at every level. It doesn’t even require employees consciously trying to hide things; selection pressure suffices. Sometimes important problems become public knowledge when a disaster occurs, but that’s usually after the disaster. The only reason we haven’t died of it yet is that it is hard to wipe out the human species with only 20th-century human capabilities.

Less Basic: Knowing What To Look For

Example/​Analogy: The Fusion Power Generator

Suppose, a few years from now, I prompt GPT-N to design a cheap, simple fusion power generator—something I could build in my garage and use to power my house. GPT-N succeeds. I build the fusion power generator, find that it works exactly as advertised, share the plans online, and soon the world has easy access to cheap, clean power.

One problem: at no point did it occur to me to ask “Can this design easily be turned into a bomb?”. Had I thought to prompt it with the question, GPT-N would have told me that the design could easily be turned into a bomb. But I didn’t think to ask, so GPT-N had no reason to mention it. With the design in wide use, it’s only a matter of time until people figure it out. And so, just like that, we live in a world where anyone can build a cheap thermonuclear warhead in their garage.

The root problem here is that I didn’t think to ask the right question; I didn’t pay attention to the right thing. An iterative design loop can sometimes help with that—empirical observation can draw our attention to previously-ignored issues. But when the failure mode does not happen in testing, the iterative design loop generally doesn’t draw our attention to it. An iterative design loop does not, in general, tell us which questions we need to ask.

Ok, but can’t we have an AI tell us what questions we need to ask? That’s trainable, right? And we can apply the iterative design loop to make AIs suggest better questions?

Example/​Analogy: Gunpowder And The Medieval Lord

Imagine a medieval lord in a war against someone with slightly more advanced technological knowledge. We’re not talking modern weaponry here, just gunpowder.

To the lord, it doesn’t look like the technologist is doing anything especially dangerous; mostly the technologist looks like an alchemist or a witch doctor. The technologist digs a hole, stretches a cloth over, dumps a pile of shit on top, then runs water through the shit-pile for a while. Eventually they remove the cloth and shit, throw some coal and brimstone in the hole, and mix it all together.

From the lord’s perspective, this definitely looks weird and mysterious, and they may be somewhat worried about weird and mysterious things in general. But it’s not obviously any more dangerous than, say, a shaman running a spiritual ceremony.

It’s not until after the GIANT GODDAMN EXPLOSION that the shit-pile starts to look unusually dangerous.

Now, what helpful advice could an AI give this medieval lord?

Obviously the AI could say “the powder which comes out of that weird mysterious process is going to produce a GIANT GODDAMN EXPLOSION”. The problem is, it is not cheap for the medieval lord to verify the AI’s claim. Based on the lord’s knowledge, there is no a-priori reason to expect the process to produce explosives rather than something else, and the amount of background knowledge the lord would need in order to verify the theory is enormous. The lord could in-principle verify the AI’s claim experimentally, but then (a) the lord is following a complex procedure which he does not understand handed to him by a not-necessarily-friendly AI, and (b) the lord is mixing homemade explosives in his backyard. Both of these are dubious decisions at best.

So if we’re already confident that the AI is aligned, sure, it can tell us what to look for. But if there’s two AIs side-by-side, and one is saying “that powder will explode” and the other is saying “the shit-pile ceremony allows one to see the world from afar, perhaps to spot holes in our defenses”, the lord cannot easily see which of them is wrong. The two can argue with each other debate-style, and the lord still will not easily be able to see which is wrong, because he would need enormously more background knowledge to evaluate the arguments correctly. And if he can’t tell what the problem is, then the iterative design process can’t fix it.

Example/​Analogy: Leaded Gasoline

Leaded gasoline is a decent historical analogue of the Fusion Generator Problem, though less deadly. It did solve a real problem: engines ran smoother with leaded gas. The problems were nonobvious, and took a long time to manifest. The iterative design loop did not work, because we could not see the problem just by testing out leaded gas in a lab. A test would have had to run for decades, at large scale, in order to see the issue—and that’s exactly what happened.

One could reasonably object to this example as an analogy, on the basis that things which drive the human species extinct would be more obvious. Dead bodies draw attention. But what about things which make the human species more stupid or aggressive? Lead did exactly that, after all. It’s not hard to imagine a large-scale issue which makes humans stupid or aggressive to a much greater extent, but slowly over the course of years or decades, with the problems going undetected or unaddressed until too late.

That’s not intended to be a highly probable story; there’s too much specific detail. The point is that, even if the proximate cause of extinction is obvious, the factors which make that proximate cause possible may not be. A gradual path to extinction is a real possibility. When problems only manifest on long timescales, the iterative design process is bad at fixing them.

Meta Example/​Analogy: Expertise and Gell-Mann Amnesia

If you don’t know a fair bit about software engineering, you won’t be able to distinguish good from bad software engineers.

(Assuming stats from a couple years ago are still representative, at least half my readers can probably confirm this from experience. On the other hand, last time I brought this up, one commenter said something along the lines of “Can’t we test whether the code works without knowing anything about programming?”. Would any software engineers like to explain in the comments why that’s not the only key question to ask?)

Similarly, consider Gell-Mann Amnesia:

You open the newspaper to an article on some subject you know well. In Murray’s case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the “wet streets cause rain” stories. Paper’s full of them.

In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know.

I think there’s a similar effect for expertise: software engineers realize that those outside their field have difficulty distinguishing good from bad software engineers, but often fail to generalize this to the insight that non-experts in most fields have difficulty distinguishing good from bad practitioners. There are of course some general-purpose tricks (and they are particularly useful expertise to have), but they only get you so far.

The difficulty of distinguishing good from bad experts breaks the iterative design loop at a meta level. We realize that we might not be asking the right questions, our object-level design loop might not suffice, so we go consult some experts. But then how do we iterate on our experts? How do we find better experts, or create better experts? Again, there are some general-purpose tricks available, but they’re limited. In general, if we cannot see when there’s a problem with our expert-choice, we cannot iterate to fix that problem.

More Fundamental: Getting What We Measure

I’m just going to directly quote Paul’s post on this one:

If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob’s behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods.

But if I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve.

Some examples of easy-to-measure vs. hard-to-measure goals:

  • Persuading me, vs. helping me figure out what’s true. (Thanks to Wei Dai for making this example crisp.)

  • Reducing my feeling of uncertainty, vs. increasing my knowledge about the world.

  • Improving my reported life satisfaction, vs. actually helping me live a good life.

  • Reducing reported crimes, vs. actually preventing crime.

  • Increasing my wealth on paper, vs. increasing my effective control over resources.

If I want to help Bob figure out whether he should vote for Alice, that can’t be done by trial and error. That really gets at the heart of why the iterative design loop is unlikely to suffice for alignment, even though it works so well in so many other fields. In other fields, we usually have a pretty good idea of what we want. In alignment, figuring out what we want is itself a central problem. Trial and error doesn’t suffice for figuring out what we want.

So what happens if we rely on trial and error to figure out what we want? More from Paul’s post:

We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart:

  • Corporations will deliver value to consumers as measured by profit. Eventually this mostly means manipulating consumers, capturing regulators, extortion and theft.

  • Investors will “own” shares of increasingly profitable corporations, and will sometimes try to use their profits to affect the world. Eventually instead of actually having an impact they will be surrounded by advisors who manipulate them into thinking they’ve had an impact.

  • Law enforcement will drive down complaints and increase reported sense of security. Eventually this will be driven by creating a false sense of security, hiding information about law enforcement failures, suppressing complaints, and coercing and manipulating citizens.

  • Legislation may be optimized to seem like it is addressing real problems and helping constituents. Eventually that will be achieved by undermining our ability to actually perceive problems and constructing increasingly convincing narratives about where the world is going and what’s important.

For a while we will be able to overcome these problems by recognizing them, improving the proxies, and imposing ad-hoc restrictions that avoid manipulation or abuse. But as the system becomes more complex, that job itself becomes too challenging for human reasoning to solve directly and requires its own trial and error, and at the meta-level the process continues to pursue some easily measured objective (potentially over longer timescales). Eventually large-scale attempts to fix the problem are themselves opposed by the collective optimization of millions of optimizers pursuing simple goals.

As this world goes off the rails, there may not be any discrete point where consensus recognizes that things have gone off the rails.


We might describe the result as “going out with a whimper.” Human reasoning gradually stops being able to compete with sophisticated, systematized manipulation and deception which is continuously improving by trial and error; human control over levers of power gradually becomes less and less effective; we ultimately lose any real ability to influence our society’s trajectory.

Summary & Takeaways

In worlds where the iterative design loop works for alignment, we probably survive AGI. So, if we want to improve humanity’s chances of survival, we should mostly focus on worlds where, for one reason or another, the iterative design loop fails. Fast takeoff and deceptive inner misalignment are two widely-talked-about potential failure modes, but they’re not the only ones. I wouldn’t consider either of them among the most robust ways in which the design loop fails, although they are among the most obviously and immediately dangerous failures.

Among the most basic robust design loop failures is problem-hiding. It happens all the time in the real world, and in practice we tend to not find out about the hidden problems until after a disaster occurs. This is why RLHF is such a uniquely terrible strategy: unlike most other alignment schemes, it makes problems less visible rather than more visible. If we can’t see the problem, we can’t iterate on it.

A more complicated and less legible class of design loop failures is not knowing what to look for. We might just not ask the right questions (as in the fusion power generator example), we might not even have enough background knowledge to recognize the right questions when they’re pointed out (as in the medieval lord example), it might take a very long time to get feedback on the key problems (as in the leaded gasoline example), and at a meta level we might not have the expertise to distinguish real experts from non-experts when seeking advice (as in Gell-Mann Amnesia).

Finally, we talked about Paul’s “You Get What You Measure” scenario. As Paul put it: “If I want to help Bob figure out whether he should vote for Alice—whether voting for Alice would ultimately help create the kind of society he wants—that can’t be done by trial and error.” That really captures the core reason why an iterative design loop is likely to fail for alignment, despite working so well in so many other fields: in other fields, we usually know what we want and are trying to get it. In alignment, figuring out what we want is itself a central problem, and the iterative design loop does not suffice for figuring out what we want.