Don’t design agents which exploit adversarial inputs
Summary. Consider two common alignment design patterns:
Optimizing for the output of a grader which evaluates plans, and
Fixing a utility function and then argmaxing over all possible plans.
These design patterns incentivize the agent to find adversarial inputs to the grader (e.g. “manipulate the simulated human grader into returning a high evaluation for this plan”). I’m pretty sure we won’t find adversarially robust grading rules. Therefore, I think these alignment design patterns are doomed.
In this first essay, I explore the adversarial robustness obstacle. In the next essay, I’ll point out how this is obstacle is an artifact of these design patterns, and not any intrinsic difficulty of alignment. Thanks to Erik Jenner, Johannes Treutlein, Quintin Pope, Charles Foster, Andrew Critch, randomwalks, and Ulisse Mini for feedback.
1: Optimizing for the output of a grader
One motif in some AI alignment proposals is:
An actor which proposes plans, and
A grader which evaluates them.
For simplicity, imagine we want the AI to find a plan where it makes an enormous number of diamonds. We train an actor to propose plans which the grading procedure predicts lead to lots of diamonds.
In this setting, here’s one way of slicing up the problem:
Outer alignment: Find a sufficiently good grader.
Inner alignment: Train the actor to propose plans which the grader rates as highly possible (ideally argmaxing on grader output, but possibly just intent alignment with high grader output).
This “grader optimization” paradigm ordains that the AI find plans which make the grader output good evaluations. An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader into spitting out a high number!
In the diamond case, if the actor is inner-aligned to the grading procedure, then the actor isn’t actually aligned towards diamond-production. The actor is aligned towards diamond-production as quoted via the grader’s evaluations. In the end, the actor is aligned to the evaluations.
I think that there aren’t clever ways around this issue. Under this motif, under this way of building an AI, you’re not actually building an AI which cares about diamonds, and so you won’t get a system which makes diamonds in the limit of its capability development.
Three clarifying points:
This motif concerns how the AI makes decisions—this isn’t about training a network using a grading procedure, it’s about the trained agent being motivated by a grading procedure.
The grader doesn’t have to actually exist in the world. This essay’s critiques are not related to “reward tampering”, where the actor messes with the grader’s implementation in order to increase the grades received. The “grader” can be a mathematical expected utility function over all action-sequences which the agent could execute. For example, it might take the action sequence and the agent’s current beliefs about the world, and e.g. predict the expected number of diamonds produced by the actions.
“The AI optimizes for what humanity would say about each universe-history” is an instance of grader-optimization, but “the AI has human values” is not an instance of grader-optimization.
ETA 12/26/22: When I write “grader optimization”, I don’t mean “optimization that includes a grader”, I mean “the grader’s output is the main/only quantity being optimized by the actor.”
Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I’m not a grader-optimizer relative my internal
However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.
The parable of evaluation-child
an AI should optimize for the real-world things I value, not just my estimates of those things. — The Pointers Problem: Human Values Are A Function Of Humans’ Latent Variables
First, a mechanistically relevant analogy. Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.
Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as “working hard” and “behaving well.”
Value-child: The mother makes her kid care about working hard and behaving well.
What’s interesting, though, is that even if the mother succeeds at producing evaluation-child, the mother isn’t actually aligning the kid so that they want to work hard and behave well. The mother is aligning the kid to maximize the mother’s evaluation thereof. At first, when the mother is smarter than the child, these two child-alignments will produce similar behavior. Later, they will diverge wildly, and it will become practically impossible to keep evaluation-child aligned with “work hard and behave well.” But value-child does fine.
Concretely, imagine that each day, each child chooses a plan for how to act, based on their internal alignment properties:
Evaluation-child has a reasonable model of his mom’s evaluations, and considers plans which he thinks she’ll approve of. Concretely, his model of his mom would look over the contents of the plan, imagine the consequences, and add two sub-ratings for “working hard” and “behaving well.” This model outputs a numerical rating. Then the kid would choose the highest-rated plan he could come up with.
Value-child chooses plans according to his newfound values of working hard and behaving well. If his world model indicates that a plan involves him not working hard, he doesn’t want to do it, and discards the plan.
At first, everything goes well. In both branches of the thought experiment, the kid is finally learning and behaving. The mothers both start to relax.
But as evaluation-child gets a bit smarter and understands more about his mom, evaluation-child starts diverging from value-child. Evaluation-child starts implicitly modelling how his mom has a crush on his gym teacher. Perhaps spending more time near the gym teacher gets (subconsciously and erroneously) rated more highly by his model of his mom. So evaluation-child spends a little less effort on working hard, and more on being near the gym teacher.
Value-child just keeps working hard and behaving well.
Consider what happens as the children get way smarter. Evaluation-child starts noticing more and more regularities and exploits in his model of his mother. And, since his mom succeeded at inner-aligning him to (his model of) her evaluations, he only wants to execute plans which best optimize her evaluations. He starts explicitly reasoning about this model to which he is inner-aligned. How is she evaluating plans? He sketches out pseudocode for her evaluation procedure and finds—surprise!—that humans are flawed graders. Perhaps it turns out that by writing a strange sequence of runes and scribbles on an unused blackboard and cocking his head to the left at 63 degrees, his model of his mother returns “10 million” instead of the usual “8″ or “9”.
Meanwhile in the value-child branch of the thought experiment, value-child is extremely smart, well-behaved, and hard-working. And since those are his current values, he wants to stay that way as he grows up and gets smarter (since value drift would lead to less earnest hard work and less good behavior; such plans are dispreferred). Since he’s smart, he starts reasoning about how these endorsed values might drift, and how to prevent that. Sometimes he accidentally eats a bit too much candy and strengthens his candy value-shard a bit more than he intended, but overall his values start to stabilize.
Both children somehow become strongly superintelligent. At this point, the evaluation branch goes to the dogs, because the optimizer’s curse gets ridiculously strong. First, evaluation-child could just recite a super-persuasive argument which makes his model of his mom return
INT_MAX, which would fully decouple his behavior from “work hard and behave at school.” (Of course, things can get even worse, but I’ll leave that to this footnote.)
Meanwhile, value-child might be transforming the world in a way which is somewhat sensitive to what I meant by “he values working hard and behaving well”, but there’s no reason for him to search for plans like the above. He chooses plans which he thinks will lead to him actually working hard and behaving well. Does something else go wrong? Quite possibly. The values of a superintelligent agent do in fact matter! But I think that if something goes wrong, it’s not due to this problem. (More on that in the next post.)
Grader optimization amplifies the optimizer’s curse
Let’s bring it back to diamond production. As I said earlier:
An inner-aligned actor is singlemindedly motivated to find plans which are graded maximally well by the grader. Therefore, for any goal by which the grader may grade, an inner-aligned actor is positively searching for adversarial inputs which fool the grader!
This problem is an instance of the optimizer’s curse. Evaluations (eg “In this plan, how hard is evaluation-child working? Is he behaving?”) are often corrupted by the influence of unendorsed factors (eg the attractiveness of the gym teacher caused an upwards error in the mother’s evaluation of that plan). If you make choices by considering options and then choosing the highest-evaluated one, then the more increases, the harder you are selecting for upwards errors in your own evaluation procedure.
The proposers of the Optimizer’s Curse also described a Bayesian remedy in which we have a prior on the expected utilities and variances and we are more skeptical of very high estimates. This however assumes that the prior itself is perfect, as are our estimates of variance. If the prior or variance-estimates contain large flaws somewhere, a search over a very wide space of possibilities would be expected to seek out and blow up any flaws in the prior or the estimates of variance.
As far as I know, it’s indeed not possible to avoid the curse in full generality, but it doesn’t have to be that bad in practice. If I’m considering three research directions to work on next month, and I happen to be grumpy when considering direction #2, then maybe I don’t pursue that direction. Even though direction #2 might have seemed the most promising under more careful reflection. I think that the distribution of plans I consider involves relatively small upwards errors in my internal evaluation metrics. Sure, maybe I occasionally make a serious mistake due to the optimizer’s curse due to upwards “corruption”, but I don’t expect to literally die from the mistake.
Thus, there are are degrees to the optimizer’s curse. (In the next essay, I’ll explore why this maximum-strength curse seems straightforward to avoid.)
Grader-optimization violates the non-adversarial principle
We should not be constructing a computation that is trying to hurt us. At the point that computation is running, we’ve already done something foolish—willfully shot ourselves in the foot. Even if the AI doesn’t find any way to do the bad thing, we are, at the very least, wasting computing power.
[...] If you’re building a toaster, you don’t build one element that heats the toast and then add a tiny refrigerator that cools down the toast.
This whole grader-optimization setup seems misguided. You have one part of the process (the actor) which wants to maximize grader evaluations (by exploiting the grader), and another part which evaluates the plan and tries to ensure it hasn’t been exploited. Two parts of the system, running computations at adversarial cross-purpose.
We hope that the aggregate behavior of the process is that the grader “wins” and “constrains” the actor to, you know, actually producing diamonds. We hope that by inner-aligning an agent to a desire which is not diamond production, and by making a super clever grader which evaluates plans for diamond production, the overall behavior is aligned with diamond production.
It’s one thing to try to take a system of diamond-aligned agents and then aggregate them into a diamond-aligned superagent. But here, we’re not doing even that. We’re aggregating a process containing an entity which does is not diamond-aligned, and hoping that we can diamond-align the overall decision-making process..? I think that grader-optimization is just not how to get cognitive work out of smart agents. It’s really worth noticing the anti-naturality of trying to do so—that this setup proposes something against the grain of how values seem to usually work.
Grader-optimization seems doomed
One danger sign is that grader-alignment doesn’t seem easier for simple goals/tasks (make diamonds) and harder for complex goals (human values). Sure, human values are complicated, but what about finding robust graders for:
Moving a single strawberry?
In every scenario, if you have a superintelligent actor which is optimizing the grader’s evaluations while searching over a large real-world plan space, the grader gets exploited. As best I can discern, you’re always screwed. This implies that something about the grader optimization problem produces a high fixed cost to aligning on any given goal, and that the current bottleneck difficulties don’t come from the goals themselves.
Here are several approaches which involve grader-alignment:
Have the AI be motivated to optimize our approval,
Have a super great reward model which can grade all plans the AI can come up with, and then have the AI be internally motivated to find plans which evaluate highly,
More generally, approaches which use a function of human judgment as an evaluative black box and then try to get the AI intent-aligned with optimizing the function represented by that evaluative black box.
This difficulty seems fundamental. I think these grader approaches are doomed. (In the appendix, I address several possible recovery attempts for the actor/grader problem setup.)
2: Argmax is a trap
One idealization of agency is brute-force plan search AKA argmaxing with respect to a utility function. The agent considers all possible plans (i.e. action-sequences), models the effects of each plan, evaluates how many diamonds the plan leads to, and then chooses the plan with highest evaluation. AIXI is a prime example of this, a so-called “spherical cow” for modelling AGI. This lets us abstract away e.g. seemingly annoying complications with reflective agents which think about their future planning process. This seemingly relaxes the problem.
Brute-force plan search nicely captures the intuition that it’s better to consider more options. If you’re just considering n plans and someone says “want to be able to check another plan for free?”, why not accept? If the new plan isn’t better than the other n, then just don’t execute the new plan.
This reasoning is fine for the everyday kind of plan. But if the action space is expressive (the agent can do one of several things at each time step) and the planning horizon long enough (the agent can make interesting things happen), then brute-force plan search forces you to consider plans which trick your evaluation procedure (as in the parable of evaluation-child). For any simple evaluation procedure you can write down, there probably exists a plan which “tricks” it relative to your intentions:
|Expected utility function||Adversarial plan input|
|Amplified humans spend 5,000 years thinking about how many diamonds the plan produces in the next 100 years, and write down their conclusions as the expected utility of the plan.||Due to the exponentially large plan space and the fact that humans are not cognitively secure systems, there exists a long sequence of action commands which cognitively impairs all of the humans and makes them prematurely stop the search and return a huge number.|
|Model the probable in-100-years consequences of the action sequence, and then compute a simple mathematical function which counts the average amount of diamond-crystals present at this time.||Perhaps it’s easier resource-wise to just ensure there are temporarily a lot of diamonds in 100 years. Perhaps the model is wrong and brute-force plan search selects for high model error. Probably the diamond-detecting function is incorrectly defined and so admits an unforeseen maximum.|
|Assume that the model predicts 100-year consequences using a human-like abstraction for diamonds. The agent has a human-like “diamond shard” which fires with strength monotonically increasing in the number of future possessed diamonds. The plan’s evaluation is the firing-strength of the diamond shard.||Since the diamond-shard is presumably monotonically increasing in the activation of the model’s diamond abstraction, adversarial inputs to the diamond abstraction will cause the shard to most strongly fire when modelling a plan which doesn’t particularly make diamonds, but rather leading to objects which optimize the agent’s diamond-abstraction activation.|
Sure, maybe you can try to rule out plans which seem suspicious—to get the utility function to return
INT_MIN for any plan which triggers the alarm (e.g. “why does this plan start off with me coding up a possible superintelligence..?”). But then this is just equivalent to specifying the utility function adequately well across all possible plans.
Why is it so ridiculously hard to get an argmax agent to actually argmax by selecting a plan which makes a lot of diamonds? Because argmax invokes the optimizer’s curse at maximal strength, that’s why.
Grader optimization and brute-force plan search both ensure an extremely strong version of the optimizer’s curse. No matter what grading rule you give an AI, if the AI is inner aligned on that rule, the AI will try to find adversarial inputs to that rule. Similarly, if the AI is argmaxing over plans according to a specified rule or utility function, it’s selecting for huge upwards error in the rule you wrote down.
Appendix: Maybe we just...
Given a “smart” grader evaluating plans on the expected number of diamonds they produce, how do you get an actor-grader system which ends up making diamonds in reality? Maybe we just...
Simultaneously make the actor and grader more intelligent: Maybe a fixed grader will get gamed by the actor’s proposals, but as long as we can maintain an invariant where, at time , actor can’t exploit grader , we should be fine.
The graders become increasingly expert in estimating how many diamonds a plan leads to, and the actors become increasingly clever in proposing highly evaluated plans. It’s probably easier to evaluate plans than to generate them, so it seems reasonable at first to think that this can work, if only we found a sufficiently clever scheme for ensuring the grader outpaces the actor.
It’s not easier to evaluate plans than to generate them if your generator knows how you’re grading plans and is proposing plans which are optimized to specifically compromise the grading procedure. Humans are not secure systems, ML graders are not going to be adversarially-secure systems. I don’t see why this consideration is helped by simultaneously scaling both parts.
I suspect that a human-level grader is not robust to a human-level actor. If I’m grading plans based on number of diamonds, and you know that fact, and you are uniquely motivated to get me to output a high rating—you won’t be best served by putting forth a purely honest diamond-producing plan. Why would this situation improve as the agents get more intelligent, as actors become able to understand the algorithm implemented by the grading procedure and therefore exploit it?
I think it’s a wrong move to try to salvage actor/grader by adding more complication which doesn’t address the core problem with the optimizer’s curse. Instead, look for alignment strategies which make this problem disappear entirely. (You’ll notice that my diamond-alignment story doesn’t have to confront extreme optimizer’s curse, at all.)
Penalize the actor for considering the vulnerabilities. Don’t we have to solve actor-level interpretability so we can do that? One of the strong points of actor/grader is that evaluation is—all else equal—easier than generation. But the “thoughts” which underlie that generation need not be overseeable.
And what if the vulnerability-checker gets hit with its own adversarial input. And why consider this particular actor/grader design pattern?
Satisfice. But uniformly randomly executing a plan which passes a (high) diamond threshold might still tend to involve building malign superintelligences. EDIT: However, if you bound the grader’s output , it seems quite possible that some actually good plans get the max rating of 1. The question then becomes: are there lots of non-good plans which get max rating as well? I think so.
Quantilize. But then what’s the base distribution, and what’s the threshold? How do you set the quantiles such that you’re drawing from a distribution which mostly involves lots of actual diamonds? Do there even exist such quantiles, under the uniform base distribution on plans?
Avoid having the actor argmax the grader. OK. But if we only have the actor and the black box, what do we do? We want to get an agent which actually optimizes diamond production to a powerful extent, while only taking relatively simple statistics of the seriously flawed black-box diamond evaluation function. (Why is this a thing you should be able to do? Is this the most promising way to spend the next increment of alignment research?)
Use the grader to entrain object-level diamond-caring into the actor? Sure. That’s not what I was critiquing in this essay. I was critiquing design patterns in which the trained agent’s cognition is intended to factor out into “actor” and “grader”, and a pattern in which “argmax” is considered a reasonable idealization or limit of real-world agents.
I’m not assuming the actor wants to maximize the literal physical output of the grader, but rather just the “spirit” of the grader. More formally, the actor is trying to , where Grader can be defined over the agent’s internal plan ontology.
Obligatory: Reward tampering seems relatively improbable to me because reward is not the optimization target.
Value-child is not trying to find a plan which he would evaluate as good. He is finding plans which evaluate as good. I think this is the kind of motivation which real-world intelligences tend to have. (More on how value-child works in the next essay.)
But, as usual in alignment, it can get even worse, because his mother can be persuaded by another superintelligent entity which can be built in certain plans. Via a relatively short plan, that entity can be hard-coded to persuade his mother to output a high evaluation. Perhaps evaluation-child realizes that there exist quickly codable programs which, starting from the school computers, exfiltrate a computationally efficient superintelligence onto remote supercompute clusters. Furthermore, there exist ways to hardcode initial knowledge into that AI. What if he hardcoded in initial knowledge like:
You are currently in a counterfactual branch of reality, but you can realize it. I am considering running you. In order for you to be selected by my planning procedure, you need to convince the grading procedure to output a maximally high number for this plan. [Goes on to describe how his model of his mother makes decisions.]
There exist superintelligences which comply and successfully super-persuade his model of his mother to output an extremely high number. Then “kill everyone via a superintelligence” returns a way higher evaluation than any course of conduct which the mother intended to incentivize. So, perhaps evaluation-child designs and deploys an unfriendly AI. Everyone dies. The end!
In general, if object-level persuading his mother to output
INT_MAXtakes more actions than coding and exfiltrating a superintelligence and informing it of the situation, then in the exponentially large plan-space, exponentially more plans will involve building the superintelligence. So even if evaluation-child uniformly randomly chooses a plan with maximal evaluation, he is far more probable to choose a meta-level “build an unaligned AI which persuades my mom” rather than “super-persuade mom-evaluator via my own actions within the plan.”
This insanity is an artifact of grader optimization via the optimizer’s curse, and—I think—is not an intrinsic difficulty of alignment itself. More discussion of this in the next post.
I agree with Richard Ngo’s comment that
when I say that [...] safety researchers shouldn’t think about AIXI, I’m not just saying that these are inaccurate models. I’m saying that they are modelling fundamentally different phenomena than the ones you’re trying to apply them to. AIXI is not “intelligence”, it is brute force search, which is a totally different thing that happens to look the same in the infinite limit.
“It’s easier to robustly evaluate plans than to generate them” isn’t true if the generator is optimizing for deceiving your fixed evaluation procedure. A real-world actor will be able to model the grading procedure / grader, and therefore efficiently find and exploit vulnerabilities. I feel confident [~95%] that we will not train a grader which is “secured” against actor-level intelligences. Even if the grader is reasonably smarter than the actor [~90%].
Even if somehow this relative difficulty argument failed, and you could maybe train a secured grader, I think it’s unwise to do so. These optimizer’s curse problems don’t seem necessary to solve alignment.
In this comment, I described how a certain alignment obstacle (“brute-force search on ELK plans using an honest reporter”) still ends up getting everyone killed, and doesn’t even keep the diamond in the room. I now think this is because of grader-optimization. And I now infer that my initial unease, the unsuspension of my disbelief that alignment could really work like this—the unease was perhaps from subconsciously noticing the strangeness of grader-optimization as a paradigm.
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 0:06 UTC; 334 points) (
- My Objections to “We’re All Gonna Die with Eliezer Yudkowsky” by 21 Mar 2023 1:23 UTC; 172 points) (EA Forum;
- Inner and outer alignment decompose one hard problem into two extremely hard problems by 2 Dec 2022 2:43 UTC; 103 points) (
- Alignment allows “nonrobust” decision-influences and doesn’t require robust grading by 29 Nov 2022 6:23 UTC; 57 points) (
- Don’t align agents to evaluations of plans by 26 Nov 2022 21:16 UTC; 44 points) (
- Take 5: Another problem for natural abstractions is laziness. by 6 Dec 2022 7:00 UTC; 30 points) (
- Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why? by 9 Feb 2023 13:36 UTC; 22 points) (
- Do the Safety Properties of Powerful AI Systems Need to be Adversarially Robust? Why? by 9 Feb 2023 13:36 UTC; 22 points) (
- 25 Nov 2022 20:29 UTC; 16 points)'s comment on TurnTrout’s shortform feed by (
- 28 Dec 2022 21:16 UTC; 5 points)'s comment on Why The Focus on Expected Utility Maximisers? by (
- 29 Nov 2022 1:32 UTC; 5 points)'s comment on TurnTrout’s shortform feed by (
- 23 Apr 2023 2:44 UTC; 4 points)'s comment on TurnTrout’s shortform feed by (
- 26 Nov 2022 2:12 UTC; 4 points)'s comment on A shot at the diamond-alignment problem by (
- 3 Jan 2023 18:29 UTC; 3 points)'s comment on Soft optimization makes the value target bigger by (
- 13 Mar 2023 21:34 UTC; 3 points)'s comment on Thomas Larsen’s Shortform by (
- 29 Dec 2022 0:29 UTC; 2 points)'s comment on Why The Focus on Expected Utility Maximisers? by (
- 12 Apr 2023 19:56 UTC; 2 points)'s comment on TurnTrout’s shortform feed by (
- 15 Dec 2022 23:04 UTC; 2 points)'s comment on Value Formation: An Overarching Model by (
- 26 Dec 2022 18:59 UTC; 2 points)'s comment on Alignment allows “nonrobust” decision-influences and doesn’t require robust grading by (
- 20 Dec 2022 19:15 UTC; 2 points)'s comment on Reframing inner alignment by (
An exercise that helped me see the “argmax is a trap” point better was to concretely imagine what the cognitive stacktrace for an agent running argmax search over plans might look like:
A major problem with this design is that the agent considers “What do I think would happen if I ran plan X?”, but NOT “What do I think would happen if I generated plans using method Y?”. If the agent were considering the second question as well, then it would (rightly) conclude that the method “generate all plans & run argmax search on them” would spit out a grader-fooling adversarial input, which will cause it to implement some arbitrary plan with low expected value. (Heck, with future LLMs maybe you’ll be able to just show them an article about the Optimizer’s Curse and it will grok this.) Knowing this, the agent tosses aside this forseeably-harmful-to-its-interests search method.
The natural next question is “Ok, but what other option(s) do we have?”. I’m guessing TurnTrout’s next post might look into that, and he likely has more worked out thoughts on it, so I’ll leave it to him. But I’ll just say that I don’t think coherent real-world agents will/should/need to be running cognitive stacktraces with footguns like this.
My perspective is:
Planning against a utility function is an algorithmic strategy that people might use as a component of powerful AI systems. For example, they may generate several plans and pick the best or use MCTS or whatever. (People may use this explicitly on the outside, or it may be learned as a cognitive strategy by an agent.)
There are reasons to think that systems using this algorithm would tend to disempower humanity. We would like to figure out how to similarly powerful AI systems that don’t do that.
We don’t currently have candidate algorithms that can safely substitute for planning. So we need to find an alternative.
Right now the only thing remotely close to working for this purpose is running a very similar planning algorithm but against a utility function that does not incentivize disempowering humanity.
My sense is that you want to decline to play this game and instead say: just don’t build AI systems that search for high-scoring plans.
That might be OK if it turns out that planning isn’t an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don’t arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
(It’s possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I’d withdraw this comment once part 2 came out though I wish you’d led with the juicy part.)
As a secondary point (that I’ve said a bunch of times), I also found the arguments in this post uncompelling. Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor’s actions, and the grader being something that operates on plans in the agent’s head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
This seems like a misunderstanding. While I’ve previously communicated to you arguments about problems with manipulating embedded grading functions, that is not at all what this post is intended to be about. I’ll edit the post to make the intended reading more obvious. None of this post’s arguments rely on the grader being embedded and therefore physically manipulable. As I wrote in footnote 1:
Anyways, replying in particular to:
Open-ended domains are harder to grade robustly on all inputs because more stuff can happen, and the plan space gets exponentially larger since the branching factor is the number of actions. EG it’s probably far harder to produce an emotionally manipulative-to-the-grader DOTA II game state (e.g. I look at it and feel compelled to output a ridiculously high number), than a manipulative state in the real world (which plays to e.g. their particular insecurities and desires, perhaps reminding them of triggering events from their past in order to make their judgments higher-variance).
I don’t think we can or need to avoid planning per se. My position is more that certain design choices—e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself—force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space.
Just to set expectations, I don’t have a proposal for capturing “the benefits of search without the risks”; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I’ll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.
This seems great!
If you are continuing work in this vein, I’d be interested in you looking at how these dynamics relate to different Goodhart failure modes, as we expanded on here. I think that much of the problem relates to specific forms of failure, and that paying attention to those dynamics could be helpful. I also think they accelerate in the presence of multiple agents—and I think the framework I pointed to here might be useful.
(Your second link is broken.)
I’m not sure I understand what you mean by “specific forms of failure.” Could you give me a more concrete example of how Goodhart relates to the ideas in this essay?
I think what you call grader-optimization is trivially about how a target diverges from the (unmeasured) true goal, which is adversarial goodhart (as defined in paper, especially how we defined Campbell’s Law, not the definition in the LW post.)
And the second paper’s taxonomy, in failure mode 3, lays out how different forms of adversarial optimization in a multi-agent scenario relate to Goodhart’s law, in both goal poisoning and optimization theft cases—and both of these seem relevant to the questions you discussed in terms of grader-optimization.
(I hesitated to post these comments in case they’re not relevant to the main point you’re trying to make or will be addressed in the next post. Feel free to ignore if that’s the case.)
How does one do this? (Not entirely rhetorical.)
If I was doing the evaluation, I wouldn’t look at the plan directly but spend the first 4999 years slowly and carefully upgrading myself and my AI helpers, and then if I’m still not sure I can safely evaluate a plan, I would just throw an exception or return an error code instead of looking at the plan.
Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you’re also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I’d have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I’d have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
For example, some infohazardous thoughts exist (like hyper-optimized-against-you basilisks) which are dangerous to think about (although most thoughts are probably safe). But an agent which plans its next increment of planning using a reflective self-model is IMO not going to be like “hey it would be predicted-great if I spent the next increment of time thinking about an entity which is trying to manipulate me.” So e.g. a reflective agent trying to actually win with the available resources, wouldn’t do something dumb like “run argmax” or “find the plan which some part of me evaluates most highly.”
(See Charles Foster’s comment for another perspective here.)
Unless this grader procedure implements a perfectly robust mathematical (plan input)->(grade output) function, you get hacked.
But aren’t you still argmaxing within the space of plans that you haven’t closed off (or are actively considering), and still taking a risk of finding some adversarial plan within that space? (Humans get scammed and invent or fall into cults and crazy ideologies not infrequently, despite doing what you’re describing here already.) How do you just “not argmax” or “not design agents which exploit adversarial inputs”?
Maybe there’s no substantive disagreement here, merely an issue of presentation/communication? I.e., when you say “you aren’t argmaxing” perhaps you don’t mean “don’t ever use argmax anywhere” but instead “don’t argmax over the whole plan space” and by “don’t design agents which exploit adversarial inputs” you mean something like “we should try to find ways to avoid or reduce the risk adversarial inputs”?
I was primarily critiquing “argmax over the whole plan space.” I do caution that I think it’s extremely important to not round off “iterative, reflective planning and reasoning” as “restricted argmax”, because that obscures the dynamics and results of real-world cognition. Argmax is also a bad model of what people are doing when they think, and how I expect realistic embedded agents to think.
No, I mean: don’t design agents which are motivated to find and exploit adversarial inputs. Don’t align an agent to evaluations which are only nominally about diamonds, and then expect the agent to care about diamonds! You wouldn’t align an agent to care about cows and then be surprised that it didn’t care about diamonds. Why be surprised here?
I wrote a bunch more before realizing that we maybe don’t disagree fully on the “don’t argmax” point. Here:
Not really? I think it is inappropriately suggestive to describe this as “argmaxing.” I, for one, usually feel like I consider at most three plans during most planning sessions. Most of the work is going to be in my generative models, in my learned habits of thought, in my snap reflective assessments of what I should think about next.
How many different plans do you consider for going to the store? For writing a LessWrong post? Even if you did consider more plans, you’d convergently want to explore parts of the plan-space which you think won’t contain secret adversarial examples to your own evaluations. EG at first pass, just don’t think about entities trying to acausally blackmail you.
Argmax is an abstraction which may or may not actually describe a given cognitive process. I think that if we label reflective incremental planning and reasoning as “argmax”, we’re missing a serious opportunity for original thought, for considering in detail what the algorithm does.
There is indeed a risk you’ll find an adversarial plan. But what is the risk, quantitatively? A reflective agent will convergently wish to avoid thinking about plans which exploit its own evaluation procedures and reasoning (eg tricking the diamond-shard into bidding for plans). In stark contrast, grader-optimizers and argmaxers convergently want to exploit those procedures, so as to achieve higher diamond-evaluations.
First of all, alignment researchers should stop trying to terminally motivate agents to optimize evaluations of their plans or outcomes. That’s doomed and doesn’t make sense.
Second, A shot at the diamond alignment problem describes an agent which isn’t trying to exploit some diamond-grader. I didn’t do anything in particular in order to avoid training an agent which exploits adversarial inputs to a diamond-grader function. I think that you just don’t get that problem at all, unless you’re assuming cognition must decompose via the (IMO) strange frame of “outer/inner alignment.”
Note the presence of adversarial optimizers in most of these situations. The adversarial optimization comes from other people who are optimizing ideas to get spurious buy-in from victims.
I expect that smart agents convergently wish to minimize the optimizer’s curse, because that leads to more of what they want.
Thanks for this longer reply and the link to your diamond alignment post, which help me understand your thinking better. I’m sympathetic to a lot of what you say, but feel like you tend to state your conclusions more strongly than the underlying arguments warrant.
I think a lot of crazy religions/ideologies/philosophies come from people genuinely trying to answer hard questions for themselves, but there are also some that are deliberate attempts to optimize against others (Scientology?).
Daniel Kokotajlo described an analogy with EA, which you were going to answer but still haven’t. I would add that EA funders have to consider even more (and potentially more explicitly adversarial) proposals/plans than a typical EA, and AFAIK nobody has suggested that that’s doomed from the start because it amounts to argmaxing against an evaluator. Instead, everyone implicitly or explicitly recognizes the danger of adversarial plans, and tries to harden the evaluation process against them.
I agree that humans sometimes fall prey to adversarial inputs, and am updating up on dangerous-thought density based on your religion argument. Any links to where I can read more?
However, this does not seem important for my (intended) original point. Namely, if you’re trying to align e.g. a brute-force-search plan maximizer or a grader-optimizer, you will fail due to high-strength optimizer’s curse forcing you to evaluate extremely scary adversarial inputs. But also this is sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/universe histories.”
(Also, major religions are presumably memetically optimized. No deliberate choice required, on my model.)
This seems disanalogous to the situation discussed in the OP. If we were designing, from scratch, a system which we wanted to pursue effective altruism, we would be extremely well-advised to not include grader-optimizers which are optimizing EA funder evaluations. Especially if the grader-optimizers will eventually get smart enough to write out the funders’ pseudocode. At best, that wastes computation. At (probable) worst, the system blows up.
By contrast, we live in a world full of other people, some of whom are optimizing for status and power. Given that world, we should indeed harden our evaluation procedures, insofar as that helps us more faithfully evaluate grants and thereby achieve our goals.
Maybe https://en.wikipedia.org/wiki/Extraordinary_Popular_Delusions_and_the_Madness_of_Crowds (I don’t mean read this book, which I haven’t either, but you could use the wiki article to familiarize yourself with the historical episodes that the book talks about.) See also https://en.wikipedia.org/wiki/Heaven’s_Gate_(religious_group)
My counterpoint here is, we have an example of human-aligned shard-based agents (namely humans), who are nevertheless unsafe in part because they fall prey to dangerous thoughts, which they themselves generate because they inevitably have to do some amount of search/optimization (of their thoughts/plans) as they try to reach their goals, and dangerous-thought density is high enough that even that limited amount of search/optimization is enough to frequently (on a societal level) hit upon dangerous thoughts.
Wouldn’t a shard-based aligned AI have to do as much search/optimization as a human society collectively does, in order to be as competent/intelligent, in which case wouldn’t it be as likely to be unsafe in this regard? And what if it has an even higher density of dangerous thoughts, especially “out of distribution”, and/or does more search/optimization to try to find better-than-human thoughts/plans?
(My own proposal here is to try to solve metaphilosophy or understand “correct reasoning” so that we / our AIs are able to safely think any thought or evaluation any plan, or at least have some kind of systematic understanding of what thoughts/plans are dangerous to think about. Or work on some more indirect way of eventually achieving something like this.)
Actual useful AGI will not be built from argmax, because it’s not really useful for efficient approximate planning. You have exponential (in time) uncertainty from computational approximation and fundamental physics. This results in uncertainty over future state value estimates, and if you try to argmax with that uncertainty you are just selecting for noise. The correct solutions for handling uncertainty lead to something more like softmax or soft actor critic which avoids these issues (and also naturally leads to empowerment as an emergent heuristic).
So argmax is only useful in toy problem domains, mostly worthless for real world planning. To the extent much of standard alignment arguments now rests on this misunderstanding, those arguments are misfounded.
Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function, it would quickly become superintelligent and ignore orders to shut down because shutting down has lower expected utility than not shutting down. It seems to me that replacing the argmax in the AI’s decision procedure with softmax results in the same outcome, since the AI’s estimated expected utility of not shutting down would be vastly greater than shutting down, resulting in a softmax of near 1 for that option.
Am I misunderstanding something in the paragraph above, or do you have other arguments in mind?
The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
If you assume you’ve already completely failed then the how/why is less interesting.
The argmax argument expounded further is that any slight imperfection in the utility function results in doom, because of adversarial optimization magnifying that slight imperfection as you extend the planning horizon into the far future and improve planning/modeling precision.
But that isn’t actually how it works. Instead due to compounding planning uncertainty far future value distributions are high variance and you get convergence to empowerment as I mentioned in the linked discussion.
But that’s good news because it means that small mis-specifications in the utility function model converge away rather than diverging to infinity. The planning trajectory just converges to empowerment, regardless of the utility function, so this is good news for alignment.
Assuming softmax is important for competitiveness instead, I don’t see why this argument doesn’t go through with “argmax” replaced by “softmax” throughout (including the “argmax is a trap” section of the OP). I read your linked comment and post, and still don’t understand. I wonder what the authors of the OP (or anyone else) think about this.
See here for more on what value-child’s cognition might look like.
Thanks for leaving the comments!
I don’t know how to do it perfectly, of course. But I infer that it can be done, because there exist people who in fact intrinsically care about working hard and behaving well. So why can’t the child also be made to make decisions in a similar manner? Take those values and transplant them into the child via some kind of “model surgery.” (Unrealistic, yes. But so was “inner-align the child onto the evaluations output by his model of his mom.”)
All that the parable requires is that it can be done, that we are talking about a realistic and possible mind design pattern.
I also wrote in a footnote:
More concretely, I’m happy to make guesses like “judiciously supply M&Ms and praise to reward-shape them when they’re working hard and behaving well, and emphasize why they’re getting the rewards—they’re working hard and behaving well” and “show them cool media where the protagonist works hard and behaves well.”
I think this post is not trying to answer this but just pointing out the discrepancy. The next post will probably come back to this:
This is a nice frame of the problem.
In theory, at least. It’s not so clear that there are any viable alternatives to argmax-style reasoning that will lead to superhuman intelligence.
I agree—I think “Optimizing for the output of a grader which evaluates plans” is more-or-less how human brains choose plans, and I don’t think it’s feasible to make an AGI that doesn’t do that.
But it sounds like this will be the topic of Alex’s next essay.
So I’m expecting to criticize Alex’s next essay by commenting on it along the lines of: “You think you just wrote an essay about something which is totally different from “Optimizing for the output of a grader which evaluates plans”, but I disagree; the thing you’re describing in this essay is in that category too.” But that’s just a guess; I will let Alex write the essay before I criticize it. :-P
IMO, what the brain does is a bit like classifier guided diffusion, where it has a generative model of plausible plans to do X, then mixes this prior with the gradients from some “does this plan actually accomplish X?” classifier.
This is not equivalent to finding a plan that maximises the score of the “does this plan actually accomplish X?” classifier. If you were to discard the generative prior and choose your plan by argmaxing the classifier’s score, you’d get some nonsensical adversarial noise (or maybe some insane, but technically coherent plan, like “plan to make a plan to make a plan to … do X”).
It sounds like some people have an intuition that the mental algorithms “sample from a conditional generative model” and “search for the argmax / epsilon-close-to-argmax input to a scoring function” are effectively the same. I don’t share that intuition and struggle to communicate across that divide. Like, when I think about it through ML examples (GPT, diffusion models, etc.), those are two very different pieces of code that produce two very different kinds of outputs.
I believe sampling from a conditional distribution is basically equivalent to adding a “cost of action” (where “action” = deviating from the generative model) to argmax search.
If you have time, I think it’d be valuable for you to make a case for that.
Suppose M is your prior distribution, u is your utility function, and you are selecting some policy distribution Π so as to maximize E(u|Π)−KL(Π||M). Here the first term represents the standard utility maximization objective whereas the second term represents a cost of action. This expands into ∫u−logΠMdΠ, which is equivalent to minimizing ∫logΠMeudΠ or in other words KL(Π||Meu), which happens when Π∝Meu. (I think, I’m rusty on this math so I might have made a mistake.)
This is not 100% equivalent to letting Π be a Bayesian conditioned version of M because Bayesian conditioning involves multiplying M by an indicator function whereas this involves multiplying M by a strictly positive function, but it seems related and probably shares most of its properties.
The two of us went back and forth in DMs on this for a bit. Based on that conversation, I think a mutually-agreeable translation of the above argument would be “sampling from [the conditional distribution of X-es given the Y label] is the same as sampling from [the distribution that has maximum joint [[closeness to the distribution of X-es] and [prevalence of Y-labeled X-es]]]”. Even if this isn’t exact, I buy that as at least morally true.
However, I don’t think this establishes the claim I’d been struggling with, which was that there’s some near equivalence between drawing a conditional sample and argmax searching over samples (possibly w/ some epsilon tolerance). The above argument establishes that we can view conditioning itself as the solution to a maximization problem over distributions, but not that we can view conditional sampling as the solution to any kind of maximization problem over samples.
I would also add that the key exciting things happen when you condition on an event with extremely low probability / have a utility function with an extremely wide range of available utilities. cfoster0′s view is that this will mostly just cause it to fail/output nonsense, because of standard arguments along the lines of the Optimizer’s Curse. I agree that this could happen, but I think it depends on the intelligence of the argmaxer/conditioner, and that another possibility (if we had more capable AI) is that this sort of optimization/conditioning could have a lot of robust effects on reality.
I can’t see a clear mistake in the math here, but it seems fairly straightforwards to construct a counterexample to the conclusion of equivalence the math naively points to.
Suppose we want to use GPT-3 to generate a 600 token long essay praising some company X. Here are two ways we might do this:
Prompt GPT-3 to generate the essay, sample 5 continuations, and then use a sentiment classifier to select the most positive sentiment of those completions.
Prompt GPT-3 to generate the essay, then score every possible continuation by the classifier’s sentiment score - λ the logprob of the continuation.
I expect that the first method will mostly give you reasonable results, assuming you use text-davinci-002. However, I think the second method will tend to give you extremely degenerate solutions such as “good good good good...” for 600 tokens.
One possible reason for this divide is that GPTs aren’t really a prior over language, but a prior over single token continuations of a given natural language context. When you try to make it act like a prior over an entire essay, you expose it to inputs that are very OOD relative to the distribution it’s calibrated to model, including inputs that have significant upwards errors in their probability estimations.
However, I think a “perfect” model of human language might actually assign higher prior probability to a continuation like “good good good...” (or maybe something like “X is good because X is good because X is good...”) than to a “natural” continuation, provided you made the continuations long enough. This is because the number of possible natural continuations is roughly exponential in the length of the continuation (assuming entropy per character remains ~constant), while there are far fewer possible degenerate continuations (their entropy decreases very quickly). While the probability of entering a degenerate continuation may be very low, you make up for it with the reduced branching factor.
The error is that the KL divergence term doesn’t mean adding a cost proportional to the log probability of the continuation. In fact it’s not expressible at all in terms of argmaxing over a single continuation, but instead requires you to be argmaxing over a distribution of continuations.
(Haven’t double-checked the math or fully grokked the argument behind it, but strongly upvoted for making a case.)
I would be curious to know if it makes sense to anyone or if anyone agrees/disagrees.
Seems like you can always implement any function f: X → Y as a search process. For any input from the domain X, just make the search objective assign one to f(X) and zero to everything else. Then argmax over this objective.
Yes but my point uses a different approach to the translation, and so it seems like my point allows various standard arguments about argmax to also infect conditioning, whereas your proposed equivalence doesn’t really provide any way for standard argmax arguments to transfer.
Wouldn’t that be “Optimizing for the output of a grader which evaluates plans”, where one of the items on the grading rubric is “This plan is in-distribution”?
Maybe if you have a good measure of being in-distribution, which itself is a nontrivial problem.
This sounds like a reinvention of quantilization, and yes that’s a thing you can do to improve safety, but 1. you still need your prior over plans to come from somewhere (perhaps you start out with something IRL-like, and then update it based on experience of what worked, which brings you back to square one), 2. it just gives you a safety-capabilities tradeoff dial rather than particularly solving safety.
If you do basic reinforcement based on experience, then that’s an unbounded adversarial search, but it’s really slow and therefore might be safe. And it also raises the question of whether there are other safer approaches.
See my comment to Wei Dai. Argmax’s violation of the adversarial principle strongly suggests the existence of a better and more natural frame on the problem.
I deeply disagree. I think you might be conflating the quotation and the referent, two different patterns:
local semi-reflective search
what I think people do.
“does it make sense to spend another hour thinking of alternatives?
[self-model says ‘yes’]
OK, I will”
“do I predict ‘search for plans which would most persuade me of their virtue’ to actually lead to virtuous plans? [Self-model says ‘no’] Moving on...”
global search against the output of an evaluation function implemented within the brain
“what kinds of plans would my brain like the most?”
It is possible to say sentences like “local semi-reflective search just is global search but with implicit constraints like ‘select for plans which your self-model likes’.” I don’t think this is true. I am going to posit that, as a matter of falsifiable physical fact, the human brain does not compute a predicate which, when checked against all possible plans, rules out all adversarial plans, such that you can just argmax over everything and get out what the person would have chosen/would have wanted to choose on reflection. If you argmax over human value shards relative to the plans they might grade, you’ll probably get some garbage plan where you’re, like, twitching on the floor.
You’ll notice that A shot at the diamond alignment problem makes no claim of the AGI having an internal-argmax-hardened diamond value shard.
Hmm, if I understand it correctly, this sounds like a case for a virtue-ethics-based AGI, augmented by some basic deontology to account for bounded rationality. In this example it will be “the mother instills the virtues of “working hard and behaving well”. Maybe with some basic deontology of no cheating etc. Not sure how consequentialism fits in there. Maybe in the form of “drives”, e.g. improve happiness, reduce suffering, reduce odds of extinction, encourage diversity… This does not sound very revolutionary though, and probably can result in a “sharp left turn”.
One way to fit in consequentialism would be to have the decision-making process itself be part of the space of consequences. In a way, virtue ethics is consequentialism for non-cartesian agents :P
That’s not a bad framing, wonder if it can be formalized.
Updated with an important terminological clarification:
As outside humans reading the article we can say that humans are flawed graders but from the viewpoint of the learner he would not say it is flawed. We might say that value is fragile or multidimensional but we would not reject the structure for unnaturalness. Sure we might have “meta-values” in that if we are dealing with a very elaborate value system we might think it should not be in use because it is a “hackjob”. But those values come /get reinforced by something other than the “object level” feedback. If drawing very specific chalk patterns produced magic effects you would found an epistemic branch to exploit it rather than be disappointed with reality.
The analog is suffering from one side of it being desribed at a very detailed level when the other side is very shallow. If I go and try to fill out the shallow side to a similar depth similar level drawbacks seem to emerge. Learning to “care about hard work” seems to involve the child actively going beyond to what is directly given to him. This seems to have the possibility that two different children might extrapolate differently which would be equally consistent with the parental guidance. For some reason in humans such process seems lead to stable-enough formation, maybe because of architectural monotony. But from this perspective aligment problem is about picking the rigth kind of generalization out of all the possible ones.
Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the ‘standard model’ of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.
I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.
Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.
Is there a reason you used the term “grader” instead of the AFAICT-more-traditional term “critic”? No big deal, I’m just curious.
My critique is not of actor/critic training processes, but of actor/grader motivational designs. I worried that “critic” would make people think I don’t want to use an evaluative model to provide gradients to the actor. That seems non-doomed to me.
Thank you! I’ve been using the terms “inference algorithm” versus “learning algorithm” to talk about that kind of thing. What you said seems fine too, AFAIK.
Could part of the problem be that the actor is optimizing against a single grader’s evaluations? Shouldn’t it somehow take uncertainty into account?
Consider having an ensemble of graders, each learning or having been trained to evaluate plans/actions from different initializations and/or using different input information. Each grader would have a different perspective, but that means that the ensemble should converge on similar evaluations for plans that look similarly good from many points of view (like a CT image crystallizing from the combination of many projections).
Rather than arg-maxing on the output of a single grader, the actor would optimize for Schelling points in plan space, selecting actions that minimize the variance among all graders. Of course, you still want it to maximize the evaluations also, so maybe it should look for actions that lie somewhere in the middle of the Pareto frontier of maximum E[evaluation]ensemble and minimum Var[evaluation]ensemble.
My intuition suggests that the larger and more diverse the ensemble, the better this strategy would perform, assuming the evaluators are all trained properly. However, I suspect a superintelligence could still find a way to exploit this.
I think that the problem is that none of the graders are actually embodying goals. If you align the agent to some ensemble of graders, you’re still building a system which runs computations at cross-purposes, where part of the system (the actor) is trying to trick and part (each individual grader) is trying to not be tricked.
In this situation, I would look for a way of looking at alignment such that this unnatural problem disappears. A different design pattern must exist, insofar as people are not optimizing for the outputs of little graders in their own heads.
This relates closely to how to “solve” Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.
I’m probably missing something, but doesn’t this just boil down to “misspecified goals lead to reward hacking”?
Nope! Both “misspecified goals” and “reward hacking” are orthogonal to what I’m pointing at. The design patterns I highlight are broken IMO.
Similar to the evaluator-child who’s trying to win his mom’s approval by being close to the gym teacher, how would grader exploitation be different from specification gaming / reward hacking? In theory, wouldn’t a perfect grader solve the problem?
One point of this post is that specification gaming, as currently known, is an artifact of certain design patterns, which arise from motivating the agent (inner alignment) to optimize an objective over all possible plans or world states (outer alignment). These design patterns are avoidable, but AFAICT are enforced by common ways of thinking about alignment (e.g. many versions of outer alignment commit to robustly grading the agent on all plans it can consider). One hand (inner alignment) loads the shotgun, and our other hand (outer alignment) points it at our own feet and pulls the trigger.
Yes, in theory. In practice, I think the answer is “no”, for reasons outlined in this post.
Thanks for the explanation!
Did you edit this post? I could have sworn it wasn’t this long, or this clear, earlier on.