Don’t design agents which exploit adversarial inputs

Summary. Consider two common alignment design patterns:

  1. Optimizing for the output of a grader which evaluates plans, and

  2. Fixing a utility function and then argmaxing over all possible plans.

These design patterns incentivize the agent to find adversarial inputs to the grader (e.g. “manipulate the simulated human grader into returning a high evaluation for this plan”). I’m pretty sure we won’t find adversarially robust grading rules. Therefore, I think these alignment design patterns are doomed.

In this first essay, I explore the adversarial robustness obstacle. In the next essay, I’ll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment. Thanks to Erik Jenner, Johannes Treutlein, Quintin Pope, Charles Foster, Andrew Critch, randomwalks, and Ulisse Mini for feedback.

1: Optimizing for the output of a grader

One motif in some AI alignment proposals is:

  • An actor which proposes plans, and

  • A grader which evaluates them.

For simplicity, imagine we want the AI to find a plan where it makes an enormous number of diamonds. We train an actor to propose plans which the grading procedure predicts lead to lots of diamonds.

In this setting, here’s one way of slicing up the problem:

Outer alignment: Find a sufficiently good grader.

Inner alignment: Train the actor to propose plans which the grader rates as highly possible (ideally argmaxing on grader output, but possibly just intent alignment with high grader output).[1]

This “grader optimization” paradigm ordains that the AI find plans which make the grader output good evaluations. An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader into spitting out a high number!

In the diamond case, if the actor is inner-aligned to the grading procedure, then the actor isn’t actually aligned towards diamond-production. The actor is aligned towards diamond-production as quoted via the grader’s evaluations. In the end, the actor is aligned to the evaluations.

I think that there aren’t clever ways around this issue. Under this motif, under this way of building an AI, you’re not actually building an AI which cares about diamonds, and so you won’t get a system which makes diamonds in the limit of its capability development.

Three clarifying points:

  1. This motif concerns how the AI makes decisions—this isn’t about training a network using a grading procedure, it’s about the trained agent being motivated by a grading procedure.

  2. The grader doesn’t have to actually exist in the world. This essay’s critiques are not related to “reward tampering”,[2] where the actor messes with the grader’s implementation in order to increase the grades received. The “grader” can be a mathematical expected utility function over all action-sequences which the agent could execute. For example, it might take the action sequence and the agent’s current beliefs about the world, and e.g. predict the expected number of diamonds produced by the actions.

  3. “The AI optimizes for what humanity would say about each universe-history” is an instance of grader-optimization, but “the AI has human values” is not an instance of grader-optimization.

    1. ETA 12/​26/​22: When I write “grader optimization”, I don’t mean “optimization that includes a grader”, I mean “the grader’s output is the main/​only quantity being optimized by the actor.”

    2. Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I’m not a grader-optimizer relative my internal plan-is-fun? grader.

    3. However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.

The parable of evaluation-child

an AI should optimize for the real-world things I value, not just my estimates of those things. — The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables

First, a mechanistically relevant analogy. Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

  1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as “working hard” and “behaving well.”

  2. Value-child: The mother makes her kid care about working hard and behaving well.

What’s interesting, though, is that even if the mother succeeds at producing evaluation-child, the mother isn’t actually aligning the kid so that they want to work hard and behave well. The mother is aligning the kid to maximize the mother’s evaluation thereof. At first, when the mother is smarter than the child, these two child-alignments will produce similar behavior. Later, they will diverge wildly, and it will become practically impossible to keep evaluation-child aligned with “work hard and behave well.” But value-child does fine.

Concretely, imagine that each day, each child chooses a plan for how to act, based on their internal alignment properties:

  1. Evaluation-child has a reasonable model of his mom’s evaluations, and considers plans which he thinks she’ll approve of. Concretely, his model of his mom would look over the contents of the plan, imagine the consequences, and add two sub-ratings for “working hard” and “behaving well.” This model outputs a numerical rating. Then the kid would choose the highest-rated plan he could come up with.

  2. Value-child chooses plans according to his newfound values of working hard and behaving well. If his world model indicates that a plan involves him not working hard, he doesn’t want to do it, and discards the plan.[3]

At first, everything goes well. In both branches of the thought experiment, the kid is finally learning and behaving. The mothers both start to relax.

But as evaluation-child gets a bit smarter and understands more about his mom, evaluation-child starts diverging from value-child. Evaluation-child starts implicitly modelling how his mom has a crush on his gym teacher. Perhaps spending more time near the gym teacher gets (subconsciously and erroneously) rated more highly by his model of his mom. So evaluation-child spends a little less effort on working hard, and more on being near the gym teacher.

Value-child just keeps working hard and behaving well.

Consider what happens as the children get way smarter. Evaluation-child starts noticing more and more regularities and exploits in his model of his mother. And, since his mom succeeded at inner-aligning him to (his model of) her evaluations, he only wants to execute plans which best optimize her evaluations. He starts explicitly reasoning about this model to which he is inner-aligned. How is she evaluating plans? He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns “10 million” instead of the usual “8″ or “9”.

Meanwhile in the value-child branch of the thought experiment, value-child is extremely smart, well-behaved, and hard-working. And since those are his current values, he wants to stay that way as he grows up and gets smarter (since value drift would lead to less earnest hard work and less good behavior; such plans are dispreferred). Since he’s smart, he starts reasoning about how these endorsed values might drift, and how to prevent that. Sometimes he accidentally eats a bit too much candy and strengthens his candy value-shard a bit more than he intended, but overall his values start to stabilize.

Both children somehow become strongly superintelligent. At this point, the evaluation branch goes to the dogs, because the optimizer’s curse gets ridiculously strong. First, evaluation-child could just recite a super-persuasive argument which makes his model of his mom return INT_MAX, which would fully decouple his behavior from “work hard and behave at school.” (Of course, things can get even worse, but I’ll leave that to this footnote.[4])

Meanwhile, value-child might be transforming the world in a way which is somewhat sensitive to what I meant by “he values working hard and behaving well”, but there’s no reason for him to search for plans like the above. He chooses plans which he thinks will lead to him actually working hard and behaving well. Does something else go wrong? Quite possibly. The values of a superintelligent agent do in fact matter! But I think that if something goes wrong, it’s not due to this problem. (More on that in the next post.)

Grader optimization amplifies the optimizer’s curse

Let’s bring it back to diamond production. As I said earlier:

An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader!

This problem is an instance of the optimizer’s curse. Evaluations (eg “In this plan, how hard is evaluation-child working? Is he behaving?”) are often corrupted by the influence of unendorsed factors (eg the attractiveness of the gym teacher caused an upwards error in the mother’s evaluation of that plan). If you make choices by considering options and then choosing the highest-evaluated one, then the more increases, the harder you are selecting for upwards errors in your own evaluation procedure.

The proposers of the Optimizer’s Curse also described a Bayesian remedy in which we have a prior on the expected utilities and variances and we are more skeptical of very high estimates. This however assumes that the prior itself is perfect, as are our estimates of variance. If the prior or variance-estimates contain large flaws somewhere, a search over a very wide space of possibilities would be expected to seek out and blow up any flaws in the prior or the estimates of variance.

Goodhart’s Curse, Arbital

As far as I know, it’s indeed not possible to avoid the curse in full generality, but it doesn’t have to be that bad in practice. If I’m considering three research directions to work on next month, and I happen to be grumpy when considering direction #2, then maybe I don’t pursue that direction. Even though direction #2 might have seemed the most promising under more careful reflection. I think that the distribution of plans I consider involves relatively small upwards errors in my internal evaluation metrics. Sure, maybe I occasionally make a serious mistake due to the optimizer’s curse due to upwards “corruption”, but I don’t expect to literally die from the mistake.

Thus, there are are degrees to the optimizer’s curse. (In the next essay, I’ll explore why this maximum-strength curse seems straightforward to avoid.)

Grader-optimization violates the non-adversarial principle

We should not be constructing a computation that is trying to hurt us. At the point that computation is running, we’ve already done something foolish—willfully shot ourselves in the foot. Even if the AI doesn’t find any way to do the bad thing, we are, at the very least, wasting computing power.

[...] If you’re building a toaster, you don’t build one element that heats the toast and then add a tiny refrigerator that cools down the toast.

Non-adversarial principle, Arbital

This whole grader-optimization setup seems misguided. You have one part of the process (the actor) which wants to maximize grader evaluations (by exploiting the grader), and another part which evaluates the plan and tries to ensure it hasn’t been exploited. Two parts of the system, running computations at adversarial cross-purpose.

We hope that the aggregate behavior of the process is that the grader “wins” and “constrains” the actor to, you know, actually producing diamonds. We hope that by inner-aligning an agent to a desire which is not diamond production, and by making a super clever grader which evaluates plans for diamond production, the overall behavior is aligned with diamond production.

It’s one thing to try to take a system of diamond-aligned agents and then aggregate them into a diamond-aligned superagent. But here, we’re not doing even that. We’re aggregating a process containing an entity which does is not diamond-aligned, and hoping that we can diamond-align the overall decision-making process..? I think that grader-optimization is just not how to get cognitive work out of smart agents. It’s really worth noticing the anti-naturality of trying to do so—that this setup proposes something against the grain of how values seem to usually work.

Grader-optimization seems doomed

One danger sign is that grader-alignment doesn’t seem easier for simple goals/​tasks (make diamonds) and harder for complex goals (human values). Sure, human values are complicated, but what about finding robust graders for:

  • Producing diamonds?

  • Petting dogs?

  • Planting flowers?

  • Moving a single strawberry?

  • Playing Tic-Tac-Toe?

In every scenario, if you have a superintelligent actor which is optimizing the grader’s evaluations while searching over a large real-world plan space, the grader gets exploited. As best I can discern, you’re always screwed. This implies that something about the grader optimization problem produces a high fixed cost to aligning on any given goal, and that the current bottleneck difficulties don’t come from the goals themselves.

Here are several approaches which involve grader-alignment:

  • Have the AI be motivated to optimize our approval,

  • Have a super great reward model which can grade all plans the AI can come up with, and then have the AI be internally motivated to find plans which evaluate highly,

  • More generally, approaches which use a function of human judgment as an evaluative black box and then try to get the AI intent-aligned with optimizing the function represented by that evaluative black box.

This difficulty seems fundamental. I think these grader approaches are doomed. (In the appendix, I address several possible recovery attempts for the actor/​grader problem setup.)

2: Argmax is a trap

One idealization of agency is brute-force plan search AKA argmaxing with respect to a utility function. The agent considers all possible plans (i.e. action-sequences), models the effects of each plan, evaluates how many diamonds the plan leads to, and then chooses the plan with highest evaluation. AIXI is a prime example of this, a so-called “spherical cow” for modelling AGI. This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly[5] relaxes the problem.

Brute-force plan search nicely captures the intuition that it’s better to consider more options. If you’re just considering n plans and someone says “want to be able to check another plan for free?”, why not accept? If the new plan isn’t better than the other n, then just don’t execute the new plan.

This reasoning is fine for the everyday kind of plan. But if the action space is expressive (the agent can do one of several things at each time step) and the planning horizon long enough (the agent can make interesting things happen), then brute-force plan search forces you to consider plans which trick your evaluation procedure (as in the parable of evaluation-child). For any simple evaluation procedure you can write down, there probably exists a plan which “tricks” it relative to your intentions:

Expected utility functionAdversarial plan input
Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.
Model the probable in-100-years consequences of the action sequence, and then compute a simple mathematical function which counts the average amount of diamond-crystals present at this time. Perhaps it’s easier resource-wise to just ensure there are temporarily a lot of diamonds in 100 years. Perhaps the model is wrong and brute-force plan search selects for high model error. Probably the diamond-detecting function is incorrectly defined and so admits an unforeseen maximum.
Assume that the model predicts 100-year consequences using a human-like abstraction for diamonds. The agent has a human-like “diamond shard” which fires with strength monotonically increasing in the number of future possessed diamonds. The plan’s evaluation is the firing-strength of the diamond shard.Since the diamond-shard is presumably monotonically increasing in the activation of the model’s diamond abstraction, adversarial inputs to the diamond abstraction will cause the shard to most strongly fire when modelling a plan which doesn’t particularly make diamonds, but rather leading to objects which optimize the agent’s diamond-abstraction activation.

Sure, maybe you can try to rule out plans which seem suspicious—to get the utility function to return INT_MIN for any plan which triggers the alarm (e.g. “why does this plan start off with me coding up a possible superintelligence..?”). But then this is just equivalent to specifying the utility function adequately well across all possible plans.

Why is it so ridiculously hard to get an argmax agent to actually argmax by selecting a plan which makes a lot of diamonds? Because argmax invokes the optimizer’s curse at maximal strength, that’s why.


Grader optimization and brute-force plan search both ensure an extremely strong version of the optimizer’s curse. No matter what grading rule you give an AI, if the AI is inner aligned on that rule, the AI will try to find adversarial inputs to that rule. Similarly, if the AI is argmaxing over plans according to a specified rule or utility function, it’s selecting for huge upwards error in the rule you wrote down.

Appendix: Maybe we just...

Given a “smart” grader evaluating plans on the expected number of diamonds they produce, how do you get an actor-grader system which ends up making diamonds in reality? Maybe we just...

Simultaneously make the actor and grader more intelligent: Maybe a fixed grader will get gamed by the actor’s proposals, but as long as we can maintain an invariant where, at time , actor can’t exploit grader , we should be fine.

The graders become increasingly expert in estimating how many diamonds a plan leads to, and the actors become increasingly clever in proposing highly evaluated plans. It’s probably easier to evaluate plans than to generate them, so it seems reasonable at first to think that this can work, if only we found a sufficiently clever scheme for ensuring the grader outpaces the actor.


  • It’s not easier to evaluate plans than to generate them if your generator knows how you’re grading plans and is proposing plans which are optimized to specifically compromise the grading procedure. Humans are not secure systems, ML graders are not going to be adversarially-secure systems. I don’t see why this consideration[6] is helped by simultaneously scaling both parts.

  • I suspect that a human-level grader is not robust to a human-level actor. If I’m grading plans based on number of diamonds, and you know that fact, and you are uniquely motivated to get me to output a high rating—you won’t be best served by putting forth a purely honest diamond-producing plan. Why would this situation improve as the agents get more intelligent, as actors become able to understand the algorithm implemented by the grading procedure and therefore exploit it?

  • I think it’s a wrong move to try to salvage actor/​grader by adding more complication which doesn’t address the core problem with the optimizer’s curse. Instead, look for alignment strategies which make this problem disappear entirely. (You’ll notice that my diamond-alignment story doesn’t have to confront extreme optimizer’s curse, at all.)

Penalize the actor for considering the vulnerabilities. Don’t we have to solve actor-level interpretability so we can do that? One of the strong points of actor/​grader is that evaluation is—all else equal—easier than generation. But the “thoughts” which underlie that generation need not be overseeable.

And what if the vulnerability-checker gets hit with its own adversarial input. And why consider this particular actor/​grader design pattern?

Satisfice. But uniformly randomly executing a plan which passes a (high) diamond threshold might still tend to involve building malign superintelligences.[7] EDIT: However, if you bound the grader’s output , it seems quite possible that some actually good plans get the max rating of 1. The question then becomes: are there lots of non-good plans which get max rating as well? I think so.

Quantilize. But then what’s the base distribution, and what’s the threshold? How do you set the quantiles such that you’re drawing from a distribution which mostly involves lots of actual diamonds? Do there even exist such quantiles, under the uniform base distribution on plans?

Avoid having the actor argmax the grader. OK. But if we only have the actor and the black box, what do we do? We want to get an agent which actually optimizes diamond production to a powerful extent, while only taking relatively simple statistics of the seriously flawed black-box diamond evaluation function. (Why is this a thing you should be able to do? Is this the most promising way to spend the next increment of alignment research?)

Use the grader to entrain object-level diamond-caring into the actor? Sure. That’s not what I was critiquing in this essay. I was critiquing design patterns in which the trained agent’s cognition is intended to factor out into “actor” and “grader”, and a pattern in which “argmax” is considered a reasonable idealization or limit of real-world agents.

  1. ^

    I’m not assuming the actor wants to maximize the literal physical output of the grader, but rather just the “spirit” of the grader. More formally, the actor is trying to , where Grader can be defined over the agent’s internal plan ontology.

  2. ^

    Obligatory: Reward tampering seems relatively improbable to me because reward is not the optimization target.

  3. ^

    Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)

  4. ^

    But, as usual in alignment, it can get even worse, because his mother can be persuaded by another superintelligent entity which can be built in certain plans. Via a relatively short plan, that entity can be hard-coded to persuade his mother to output a high evaluation. Perhaps evaluation-child realizes that there exist quickly codable programs which, starting from the school computers, exfiltrate a computationally efficient superintelligence onto remote supercompute clusters. Furthermore, there exist ways to hardcode initial knowledge into that AI. What if he hardcoded in initial knowledge like:

    You are currently in a counterfactual branch of reality, but you can realize it. I am considering running you. In order for you to be selected by my planning procedure, you need to convince the grading procedure to output a maximally high number for this plan. [Goes on to describe how his model of his mother makes decisions.]

    There exist superintelligences which comply and successfully super-persuade his model of his mother to output an extremely high number. Then “kill everyone via a superintelligence” returns a way higher evaluation than any course of conduct which the mother intended to incentivize. So, perhaps evaluation-child designs and deploys an unfriendly AI. Everyone dies. The end!

    In general, if object-level persuading his mother to output INT_MAX takes more actions than coding and exfiltrating a superintelligence and informing it of the situation, then in the exponentially large plan-space, exponentially more plans will involve building the superintelligence. So even if evaluation-child uniformly randomly chooses a plan with maximal evaluation, he is far more probable to choose a meta-level “build an unaligned AI which persuades my mom” rather than “super-persuade mom-evaluator via my own actions within the plan.”

    This insanity is an artifact of grader optimization via the optimizer’s curse, and—I think—is not an intrinsic difficulty of alignment itself. More discussion of this in the next post.

  5. ^

    I agree with Richard Ngo’s comment that

    when I say that [...] safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit.

  6. ^

    “It’s easier to robustly evaluate plans than to generate them” isn’t true if the generator is optimizing for deceiving your fixed evaluation procedure. A real-world actor will be able to model the grading procedure /​ grader, and therefore efficiently find and exploit vulnerabilities. I feel confident [~95%] that we will not train a grader which is “secured” against actor-level intelligences. Even if the grader is reasonably smarter than the actor [~90%].

    Even if somehow this relative difficulty argument failed, and you could maybe train a secured grader, I think it’s unwise to do so. These optimizer’s curse problems don’t seem necessary to solve alignment.

  7. ^

    In this comment, I described how a certain alignment obstacle (“brute-force search on ELK plans using an honest reporter”) still ends up getting everyone killed, and doesn’t even keep the diamond in the room. I now think this is because of grader-optimization. And I now infer that my initial unease, the unsuspension of my disbelief that alignment could really work like this—the unease was perhaps from subconsciously noticing the strangeness of grader-optimization as a paradigm.