# Alignment allows “nonrobust” decision-influences and doesn’t require robust grading

Definition. On how I use words, values are decision-influences (also known as shards). “I value doing well at school” is a short sentence for “in a range of contexts, there exists an influence on my decision-making which upweights actions and plans that lead to e.g. learning and good grades and honor among my classmates.”

Summaries of key points:

1. Nonrobust decision-influences can be OK. A candy-shard contextually influences decision-making. Many policies lead to acquiring lots of candy; the decision-influences don’t have to be “globally robust” or “perfect.”

2. Values steer optimization; they are not optimized against. The value shards aren’t getting optimized hard. The value shards are the things which optimize hard, by wielding the rest of the agent’s cognition (e.g. the world model, the general-purpose planning API).

Since values are not the optimization target of the agent with those values, the values don’t have to be adversarially robust.

3. Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values. In self-reflective agents which can think about their own thinking, values steer e.g. what plans get considered next. Therefore, these agents convergently avoid adversarial inputs to their currently activated values (e.g. learning), because adversarial inputs would impede fulfillment of those values (e.g. lead to less learning).

# I: Nonrobust decision-influences can be OK

Decision-making influences don’t have to be “robust” in order for a person to value doing well at school. Consider two people with slightly different values:

1. One person is slightly more motivated by good grades. They might study for a physics test and focus slightly more on test-taking tricks.

2. Another person is slightly more motivated by learning. They might forget about some quizzes because they were too busy reading extracurricular physics books.

But they might both care about school, in the sense of reliably making decisions on the basis of their school performance, and valuing being a person who gets good grades. Both people are motivated to do well at school, albeit in somewhat different ways. They probably will both get good grades, and they probably will both learn a lot. Different values simply mean that the two people locally make decisions differently.

If I value candy, that means that my decision-making contains a subroutine which makes me pursue candy in certain situations. Perhaps I eat candy, perhaps I collect candy, perhaps I let children tour my grandiose candy factory… The point is that candy influences my decisions. I am pulled by my choices from pasts without candy to futures with candy.

So, let be the set of mental contexts relevant for decision-making, and let be my action set.[1] My policy has type signature , and e.g. contains a bunch of shards of value which influence its outputs. The values are subcircuits of my policy network (i.e. my brain). For example, consider a candy shard consisting of the following subshards:

1. If center-of-visual-field activates candy’s visual abstraction, then grab the inferred latent object which activated the abstraction.

2. If hunger>50 and sugar-level<6, and if current-plan-stub activates candy-obtainable, then tell planning API to set subgoal to obtain candy.

3. If heard 'candy' and hunger>20, then salivate.

Suppose this is the way I value candy. A few thousand subshards which chain into the rest of my cognition and concepts. A few thousand subshards of value which were hammered into place by tens of thousands of reinforcement events across a lifetime of experience.

This shard does not need to be “robust” or “perfect.” Am I really missing much if I’m lacking candy subshard #3: “If heard 'candy' and hunger>20, then salivate”? I don’t think it makes sense to call value shards “perfect” or not.[2] The shards simply influence decisions.

There are many, many configurations and parameter settings of these subshards which lead to valuing candy. The person probably still values candy, even if you:

• Delete a bunch of the subshards.

• Modify the activation-strength of a bunch of subshards (roughly, change how much control the subshard has on the next-thought “logits”).

• Change some of the activation contexts to other common activation contexts (e.g. “If heard 'candy'” changes to “If heard 'sweets'”, change to hunger>14 in subshard 3).

It seems to me like “does the person still prioritize candy” depends on a bunch of factors, including:

1. Retention of core abstractions (to some tolerance)

1. If we find-replaced candy with flower, the person probably now has a strange flower-value, where they eat flowers when hungry.

2. However, the abstraction also doesn’t have to be “perfect” (whatever that means) in order to activate in everyday situations. Two people will have different candy abstractions, and yet they can both value candy.

2. Strength and breadth of activation contexts

1. The more situations a candy-value affects decision-making in, the stronger the chance that candy remains a big part of their life.

3. How often the candy shard will actually activate

1. As an unrealistic example, if the person never enters a cognitive situation which substantially activates the candy-shard, then don’t expect them to eat much candy.

2. This is another source of value/​decision-influence robustness, as e.g. an AI’s values don’t have to be OK in every cognitive context.[3]

3. Consider an otherwise altruistic man who has serious abuse and anger problems whenever he enters a specific vacation home with his wife, but is otherwise kind and considerate. As long as he doesn’t start off in that home but knows about the contextual decision-influence, he will steer away from that home and try to remove the unendorsed value.

4. Reflectivity of the candy shard

1. (This is more complicated and uncertain. I’ll leave it for now.)

Suppose we wanted to train an agent which gets really smart and acquires a lot of candy, now and far into the future. That agent’s decision-influences don’t have to be globally robust (e.g. in every cognitive situation, the agent is motivated by candy and only by candy) in order for an agent to make locally good decisions (e.g. make lots of candy now and into the future).

# II: Values steer optimization; they are not optimized against

Given someone’s values, you might wonder if you can “maximize” those values. On my ontology—where values are decision-influences, a sort of contextual wanting—“literal value maximization” is a type error.[4] In particular, given e.g. someone who values candy, there very probably isn’t a part of that person’s cognition which can be argmaxed to find a plan where the person has lots of candy.

So if I have a candy-shard, if I value candy, if I am influenced to decide to pursue candy in certain situations, then what does it mean to maximize my candy value? My value is a subcircuit of my policy. It doesn’t necessarily even have an ordering over its outputs, let alone a numerical rating which can be maximized. “Maximize my candy-value” is, in a literal sense, a type error. What quantity is there to maximize?

the True Name of a thing [is] a mathematical formulation sufficiently robust that one can apply lots of optimization pressure without the formulation breaking down[...]

If we had the “True Name” of human values (insofar as such a thing exists), that would potentially solve the problem [of supervised labels only being proxies for what we want].

Why Agent Foundations? An Overly Abstract Explanation

In particular, there’s no guarantee that you can just scan someone’s brain and find some True Name of Value which you can then optimize without fear of Goodhart. It’s not like we don’t know what people value, but if we did, we would be OK. I’m pretty confident there does not exist anything within my brain which computes a True Name for my values, ready to be optimized as hard as possible (relative to my internal plan ontology) and yet still producing a future where I get candy.

Therefore, even though you truly care about candy, that doesn’t mean you can just whip out the argmax on the relevant shard of your cognition, so as to “maximize” that shard (e.g. via extremizing the rate of action potentials on its output neurons) and then get a future with lots of candy. You’d probably just find a context which acts as an adversarial input to the candy-shard, even though you do really care about candy in a normal, human way.

Complexity of human values isn’t what stops you from argmaxing human values and thereby finding a good plan. That’s not a sensible thing to try. Values are not, in general, the kind of thing which can be directly optimized over, where you find plans which “maximally activate” your e.g. candy-subshards. Values influence decisions.

If you’re confused at the distinction between “optimizing against values” and “values influencing decisions”, read Don’t align agents to evaluations of plans.

There is real difficulty and peril in motivating an AI, in making sure its decisions chain into each other towards the right kinds of futures. If you train a superintelligent sovereign agent which primarily values irrelevant quantities (like paperclips) but doesn’t care about you, which then optimizes the whole future hard, then you’re dead. But consider that deleting candy subshard #3 (“If heard 'candy' and hunger>20, then salivate”) doesn’t stop someone from valuing candy in the normal way. If you erase that subshard from their brain, it’s not like they start “Goodharting” and forget about the “true nature” of caring about candy because they now have an “imperfect proxy shard.”

An agent argmax’ing an imperfect evaluation function will indeed exploit that function; there are very few degrees of freedom in specifying an inexploitable evaluation function. But that’s because that grading function must be globally robust.

When I talk about shard theory, people often seem to shrug and go “well, you still need to get the values adversarially-robustly-correct else Goodhart; I don’t see how this ‘value shard’ thing helps.” That’s not how values work, that is not what value-shards are. Unlike grader-optimizers which try to maximize plan evaluations, a values-executing agent doesn’t optimize its values as hard as possible. The agent’s values optimize the world. The values are rules for how the agent acts in relevant contexts.[5]

# III: Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values

Question: If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?

(Maybe you can now answer this question. I encourage you to try before moving on.)

Imagine a mother whose child has been goofing off at school and getting in trouble. The mom just wants her kid to take education seriously and have a good life. Suppose she had two (unrealistic but illustrative) choices.

1. Evaluation-child: The mother makes her kid care extremely strongly about doing things which the mom would evaluate as “working hard” and “behaving well.”

2. Value-child: The mother makes her kid care about working hard and behaving well.

To make evaluation-child work hard, we have to somehow specify a grader which can adequately grade all plans which evaluation-child can imagine. The highest-rated imaginable plan must involve working hard. This requirement is extreme.

Value-child doesn’t suffer this crippling “robustly grade exponentially many plans” alignment requirement. I later wrote a detailed speculative account of how value-child’s cognition might work—what it means to say that he “cares about working hard.” But, at a higher level, what are the main differences between evaluation- and value-child?

This may sound obvious, but I think that the main difference is that value-child actually cares about working hard. Evaluation-child cares about evaluations. (See here if confused on the distinction.) To make evaluation-child work hard in the limit of intelligence, you have to robustly ensure that max evaluations only come from working hard. This sure sounds like a slippery and ridiculous kind of thing to try, like wrestling a frictionless pig. It should be no surprise you’ll hit issues like nearest unblocked strategy in that paradigm.

An agent which does care about working hard will want to not think thoughts which lead to not working hard. In particular, reflective shard-agents can think about what to think, and thereby are convergently-across-values incentivized to steer clear of adversarial inputs to their own values.

Reflective agents can think about their own thought process (e.g. “should I spend another five minutes thinking about what to write for this section?”). I think they do this via their world-model predicting internal observables (e.g. future neuron activations) and thus high-level statistics like “If I think for 5 more minutes, will that lead to a better post or not?”.

Thoughts about future thinking are a kind of decision. Decisions are steered by values. Therefore, thoughts about future thinking are steered by whatever value shards activate in that mental context. For example, a self-care value might activate, and a learning-shard, and a social value might activate as well. They control your reflective thoughts, just like other shards would control your (“normal”) actions (like crossing the room).

[In] the optimizer’s curse, evaluations (eg “In this plan, how hard is evaluation-child working? Is he behaving?”) are often corrupted by the influence of unendorsed factors (eg the attractiveness of the gym teacher caused an upwards error in the mother’s evaluation of that plan). If you make choices by considering options and then choosing the highest-evaluated one, then the more increases, the harder you are selecting for upwards errors in your own evaluation procedure.

The proposers of the Optimizer’s Curse also described a Bayesian remedy in which we have a prior on the expected utilities and variances and we are more skeptical of very high estimates. This however assumes that the prior itself is perfect, as are our estimates of variance. If the prior or variance-estimates contain large flaws somewhere, a search over a very wide space of possibilities would be expected to seek out and blow up any flaws in the prior or the estimates of variance.

Goodhart’s Curse, Arbital

As far as I know, it’s indeed not possible to avoid the curse in full generality, but it doesn’t have to be that bad in practice. If I’m considering three research directions to work on next month, and I happen to be grumpy when considering direction #2, then maybe I don’t pursue that direction. Even though direction #2 might have seemed the most promising under more careful reflection. I think that the distribution of plans I consider involves relatively small upwards errors in my internal evaluation metrics. Sure, maybe I occasionally make a serious mistake due to the optimizer’s curse due to upwards “corruption”, but I don’t expect to literally die from the mistake.

Thus, there are are degrees to the optimizer’s curse.

Both grader-optimization and argmax cause extreme, horrible optimizer’s curse. Is alignment just that hard? I think not.

The distribution of plans which I usually consider is not going to involve any set of mental events like “consider in detail building a highly persuasive superintelligence which persuades you to build it.” While a reflective diamond-valuing AI which did execute the plan’s mental steps might get hacked by that adversarial input, there would be no reason for it to seek out that plan to begin with.

The diamond-valuing AI would consider a distribution of plans far removed from the extreme upwards errors highlighted in the evaluation-child story. (I think that this is why you, in your day-to-day thinking, don’t have to worry about plans which are extreme adversarial inputs to your own evaluation procedures.) Even though a smart reflective AI may be implicitly searching over a range of plans, it’s doing so reflectively, thinking about what to think next, and perhaps not taking cognitive steps which it reflectively predicts to lead to bad outcomes (e.g. via the optimizer’s curse).

On the other hand, an AI which is aligned on the evaluation procedure is incentivized to seek out huge upwards errors on the evaluation procedure relative to the intended goal. The actor is trying to generate plans which maximally exploit the grader’s reasoning and judgment.[6]

Thus, if an AI cares about diamonds (i.e. has an influential diamond-shard), that AI might accidentally select a plan due to upwards evaluative noise, but that does not mean the AI is actively looking for plans to fool its diamond-shard into oblivion. The AI may make a mistake in its reflective predictions, but there’s no extreme optimization pressure for it to make mistakes like that, and the AI wants to avoid those mistakes, and so those mistakes remain unlikely. I think that reflective, smart AIs convergently want to avoid duping their own evaluative procedures, for the same reasons you want to avoid doing that to yourself.

More precisely:

1. A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.

2. The agent can predict e.g. how diamond-promising it is to search for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds, versus plans where the agent just improves its synthesis methods.

3. A reflective agent thinks that the first plan doesn’t lead to many diamonds, while the second plan leads to more diamonds.

4. Therefore, the reflective agent chooses the second plan over the first plan, automatically[7] avoiding the worst parts of the optimizer’s curse. (Unlike grader-optimization, which seeks out adversarial inputs to the diamond-motivated part of the system.)

Therefore, avoiding the high-strength curse seems conceptually straightforward. In the case of aligning an AI to produce lots of diamonds, we want the AI to superintelligently generate and execute diamond-producing plans because the AI expects those plans to lead to lots of diamonds. I have spelled out a plausible-to-me story for how to accomplish this. The story is simple in its essential elements: finetune a pretrained model by rewarding it when it collects diamonds.

While that story has real open questions, that story also totally sidesteps the problems with grader-optimization. You don’t have to worry about providing some globally unhackable evaluation procedure to make super duper sure the agent’s plans “really” involve diamonds. If the early part of training goes as described, the agent wants to make diamonds, and (as I explained in the diamond-alignment story) it reflectively wants to avoid duping itself because duping itself leads to fewer diamonds.

If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?

A reflective agent wishes to minimize the optimizer’s curse (relative to its own values), instead of maximizing it (relative to the goal by which the grader evaluates plans). While I don’t yet have satisfying pseudocode for reflective planning agents (but see Appendix B for preliminary pseudocode, effective reflective agents do exist. In this regime, it seems like many scary problems go away and don’t come back. That is an enormous blessing.[8]

## Argmax is an importantly inappropriate idealization of agency

If the answer to “how do we dispel the max-strength optimizer’s curse” is in fact “real-world reflective agents do this naturally”, then assuming unreflectivity will rule out the part of solution-space containing the actual solution:

As a further-simplified but still unsolved problem, an unreflective diamond maximizer is a diamond maximizer implemented on a Cartesian hypercomputer in a causal universe that does not face any Newcomblike problems. This further avoids problems of reflectivity and logical uncertainty. In this case, it seems plausible that the primary difficulty remaining is just the ontology identification problem.

The argmax and unreflectivity assumptions were meant to make the diamond-maximizer problem easier. Ironically, however, these assumptions may well render the diamond-maximizer problem unsolvable, leading us to resort to increasingly complicated techniques and proposals, none of which seem to solve “core” problems like evaluation-rule hacking…

# Conclusion

1. Nonrobust decision-influences can be OK.

2. Values steer optimization; they are not optimized against.

3. Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values.

The answer is not to find a clever way to get a robust grader. The answer is to not need a robust grader. Form e.g. a diamond-production value within a reflective and smart agent, and this diamond-production value won’t be incentivized to fool itself. You won’t have to “robustly grade” it to make it produce diamonds.

Thanks to Tamera Lanham, John Wentworth, Justis Mills, Erik Jenner, Johannes Treutlein, Quintin Pope, Charles Foster, Andrew Critch, randomwalks, Ulisse Mini, and Garrett Baker for thoughts. Thanks to Vivek Hebbar for in-person discussion.

1. Uncertainty about how human values work. Suppose we think that human values are so complex, and there’s no real way to understand them or how they get generated. We imagine a smart AI as finding futures which optimize some grading rule, and so we need something to grade those futures. We think we can’t get the AI to grade the futures, because human values are so complex. What options remain available? Well, the only sources of “good judgment” are existing humans, so we need to find some way to use those humans to target the AI’s powerful cognition. We give the alignment, the AI gives the cognitive horsepower.

We’ve fallen into the grader-optimization trap.

2. Non-embedded forms of agency. This encourages considering a utility function maximized over all possible futures. Which automatically brings the optimizer’s curse down to bear at maximum strength. You can’t specify a utility function which is robust against that.

This seems sideways of real-world alignment, where realistic motivations may not be best specified in the form of “utility function over observation/​universe histories.”

# Appendix B: Preliminary reflective planning pseudocode

I briefly took a stab at writing pseudocode for a values-based agent like value-child. I think this code leaves a lot out, but I figured it’d be better to put something here for now.

'''
Here is one meta-plan for planning. The agent starts from no plan at all, iteratively generates improvements, which get accepted if they lead to more predicted diamonds. Depending on the current situation (as represented in the WM), the agent might execute a plan in which it looks for nearby diamonds, or it might execute a plan where it runs a different kind of heuristic search on a certain class of plans (e.g. research improvements to the AI's diamond synthesis pathway).

Any real shard agent would be reasoning and updating asynchronously, so this setup assumes a bit of unrealism.

This function only modifies the internal state of the agent (self.recurrent), so as to be ready for a call of self.getDecisions().
'''
def plan(self):
# Generate an initial plan
plan = Plan() # Do nothing plan
conseq = self.WM.getConseq(plan)
currentPlanEval = self.diamondShard(conseq)

# Iteratively modify the plan until the generative model can't find a way to make it better
while True:
# Sample 5 plan modifications from generative model
plans = self.WM.planModificationSample(n=5,stub=plan)

# Select first local improvement
for planMod in plans:
# Reflectively predict consequences of this plan
newPlan = plan.modify(planMod)
conseq = self.WM.getConseq(plan)

# Take local improvement
if self.diamondShard(conseq) > currentPlanEval:
plan = newPlan
currentPlanEval = self.diamondShard(conseq)
continue # Generate more modifications
# Execute the plan, which possibly involves running plan search with a different algorithm and plan initialization.
isDone = plan.exec()
if isDone: break

# Appendix C: Value shards all the way down

I liked Vivek Hebbar’s recent comment (in the context of e.g. caring about your family and locally evaluating plans on that basis, but also knowing that your evaluation ability itself is compromised and will mis-rate some plans):

My attempt at a framework where “improving one’s own evaluator” and “believing in adversarial examples to one’s own evaluator” make sense:

• The agent’s allegiance is to some idealized utility function (like CEV). The agent’s internal evaluator is “trying” to approximate by reasoning heuristically. So now we ask Eval to evaluate the plan “do argmax w.r.t. Eval over a bunch of plans”. Eval reasons that, due to the the way that Eval works, there should exist “adversarial examples” that score very highly on Eval but low on . Hence, Eval concludes that is low, where plan = “do argmax w.r.t. Eval”. So the agent doesn’t execute the plan “search widely and argmax”.

• “Improving ” makes sense because Eval will gladly replace itself with if it believes that is a better approximation for (and hence replacing itself will cause the outcome to score better on )

Are there other distinct frameworks which make sense here?

(I’m not sure whether Vivek meant to imply “and this is how I think people work, mechanistically.” I’m going to respond to a hypothetical other person who did in fact mean that.)

My take is that human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. “If food nearby and hunger>15, then be more likely to go to food”) and then also a sophisticated utility function (e.g. something like CEV). It’s shards all the way down, and all the way up.[10]

Vivek then wrote:

I look forward to seeing what design Alex proposes for “value child”.

Value shards steer cognition. In the main essay, I wrote:

1. A reflective diamond-motivated agent chooses plans based on how many diamonds they lead to.

2. The agent can predict e.g. how diamond-promising it is to search for plans involving simulating malign superintelligences which trick the agent into thinking the simulation plan makes lots of diamonds, versus plans where the agent just improves its synthesis methods.

3. A reflective agent knows that the first plan doesn’t lead to many diamonds, while the second plan leads to more diamonds.

4. Therefore, the reflective agent chooses the second plan over the first plan, automatically avoiding the worst parts of the optimizer’s curse. (Unlike grader-optimization, which seeks out adversarial inputs to the diamond-motivated part of the system.)

This story smoothly accomodates thoughts about improving evaluation ability.

On my understanding: Your values are steering the optimization. They are not, in general, being optimized against by some search inside of you. They are probably not pointing to some idealized utility function. The decision-influences are guiding the search. There’s no secret other source of caring, no externalized utility function.

1. ^

Formalizing the action space is a serious gloss. In people, there is no privileged “action” space, considering how I can decide what to think about next. As an embedded agent, I don’t just decide what motor commands to send and what words to say—I also can decide what to decide next, what to think about next.

I think the point of the essay stands anyways.

2. ^

This doesn’t mean that I’m using words the same way other people have, when deliberating on whether an AI’s values have to be “robust.” I’m more inclined to just carry out the shard theory analysis and see what experiences it leads me to anticipate, instead of arguing about whether my way of using words matches up with how other people have used words.

3. ^

As Charles Foster notes:

This isn’t entirely on the side of robustness. It also means that by default, even if we get the AI to have an X decision influence in one context, that doesn’t necessarily also activate in another context we might want it to generalize to.

4. ^

I think that many people say “maximize my values” to mean something like “do something as great as possible, relative to what I care about.” So, in a sense, “type error” is pedantic. But also I think the type error complaint points at something important, so I’ll say it anyways.

5. ^

If you want to argue that decision-influences have to be robust else Goodhart, you need new arguments not related to grader-optimization. It is simply invalid to say “The agent doesn’t value diamonds in some situation where lots of its values activate strongly, and therefore the agent won’t make diamonds because it Goodharts on that unrelated situation.” That is not what values do.

6. ^

In a recent Google Doc thread, grader optimization came up. Someone said to me (my reactions in italics):

So you’re imagining something like: the agent (policy) is optimizing for a reward model to produce a high number, and so the agent analyzes the reward model in detail to search for inputs that cause the reward model to give high numbers? Yes.

I think at that level of generality I don’t know enough to say whether this is good or bad. I think this is very, very probably bad.

We want our AI system to search for approaches that better enact our values. As you note, the optimizer’s curse says that we’ll tend to get approaches that overestimate how much they actually enact our values. But just knowing that the optimizer’s curse will happen doesn’t change anything; the best course of action is still to take the approach that is predicted to best enact our values. In that sense, the optimizer’s curse is typically something you have to live with, not something you can solve.[...]

A small error is not the same as a maximal error. Reflective agents can and will avoid deliberately searching for plans which maximize upwards errors in their own evaluations (e.g. generating a plan such that, while considering the plan, a superintelligence inside the plan tricks you into thinking the plan should be highly evaluated), because the agents reflectively predict that that hurts their goal achievement (e.g. leads to fewer diamonds). If you somewhat understand your decision-making, you can consider plans you’re less likely to incorrectly evaluate.

So my followup question is: can you name a single approach that doesn’t have this failure mode, while still allowing us to use the AI to do things we didn’t think about in advance? Yes.

One answer someone might give is “create an agent-with-shards that searches for approaches that score highly on the shard.

Insofar as this means “the agent looks for inputs which maximize the aggregate shard output”, no. On my model, shards grade and modify plans, including plans about which plans to consider next. They are not searching for plans which maximize evaluative output, like in the reward-model case.

An agent that searches for high scores on its shard can’t be searching for positive upwards errors in the shard; there is no such thing as an error in the shard”. To which the response is “that’s from the agent’s perspective. From the human’s perspective, the agent is searching for positive upwards differences between the shards and what-the-human-wants”.

Even if true, this would be not be an optimizer’s curse problem.

But also this isn’t true, at least not without further argumentation. If my kid likes mocha and I like latte, is my child searching for positive upwards differences between their values and mine? I think there are some situations—AI paperclips, humans values love—where the AI is searching for paperclippish plans, which will systematically be bad plans by human lights. That seems more like instrumental convergence → disempower humans → not much love left for us if we’re dead.

7. ^

I do think that e.g. a diamond-shard can get fed an adversarial input, but the diamond-shard won’t bid for a plan where it fools itself.

8. ^

It’s at this point that my model of Nate Soares wants to chime in.

Alex’s model of Nate (A-N): This sure smells like a problem redefinition, where you simply sweep the hard part of the problem under a less obvious corner of the rug. Why shouldn’t I believe you’ve just done that?

A: A reasonable and productive heuristic in general, but inappropriate here. Grader-optimization explicitly incentivizes the agent to find maximal upwards errors in a diamond-evaluation module, whereas a reflective diamond-valuing agent has no incentive to consider such plans, because it reflectively predicts those plans don’t lead to diamonds. If you disagree, please point to the part of the story where, conditional on the previous part of the story obtaining, the grader-optimization problem reappears.

A-N: Suppose we achieved your dream of forming a diamond-shard in an AI, and that that shard holds significant power over the AI’s decisions. Now the AI keeps improving itself. Doesn’t “get smarter” look a lot like “implicitly consider more options”, which brings the curse back?

A: If the agent is diamond-aligned at this point in time, I expect it stays that way for the reasons given in the “agent prevents value drift” section, along with these footnotes and the appendix. As a specific answer, though: If the agent does care about diamonds at that point it time, then it doesn’t want to get so “smart” that it deludes itself by seriously intensifying the optimizer’s curse. It doesn’t want to do so for the reason we don’t want it to do so (in the hypothetical where we just want to achieve diamond-alignment). If the reflective agent can predict that outcome of the plan, it won’t execute the plan, because that plan leads to fewer diamonds.

A-N: So the AI still has to solve the AI alignment problem, except with its successors.

A: Not all things which can be called an “AI alignment problem” are created equal. The AI has a range of advantages, and I detailed one way it could use those advantages. I do expect that kind of plan to actually work.

9. ^

I further speculate that reflective reasoning is convergently developed in real-world training processes under non-IID conditions like those described in my diamond-alignment story.

10. ^

When working out shard theory with Quintin Pope, one of my favorite moments was the click where I stopped viewing myself as some black-box optimizing “some complicated objective.” Instead, this hypothesis reduced my own values to mere reality. Every aspiration, every unit of caring, every desire for how I want the future to be bright and fun—subroutines, subshards, contextual bits of decision-making influence, all traceable to historical reinforcement and update events.

• Values steer optimization; they are not optimized against

I strongly disagree with the implication here. This statement is true for some agents, absolutely. It’s not true universally.

It’s a good description of how an average human behaves most of the time, yes. We’re often puppeted by our shards like this, and some people spend the majority of their lives this way. I fully agree that this is a good description of most of human cognition, as well.

But it’s not the only way humans can act, and it’s not when we’re at our most strategically powerful.

Consider if the value-child gets thrown in a completely alien context. Like, in-person school gets replaced with remote self-learning due to a pandemic and he moves to live for a while on a tropical island with his grandmother, who never disciplines him. Basically all of the shards that were optimized to steer him for hard work fall away: his friends aren’t there to distract him with game talk, “classes” aren’t a thing anymore, etc. On the other hand, there’s a lot of new distractions and failure modes: his grandmother cooking him cakes all the time, the sound of an ocean just outside, the ability to put off watching recorded video lectures indefinitely.

Is the value-child just doomed to be distracted, until his shards painstakingly and slowly adapt for this new context? Is he guaranteed to get nothing done his first week, say?

No: he can set “working hard” as his optimization target from the get-go, and, e. g., invent a plan of “stay on the lookout for new sources of distraction, explicitly run the world-model forwards to check whether X would distract me, and if yes, generate a new conscious heuristic for avoiding X”. But this requires “working hard” to be the value-child’s explicit consciously-known goal. Not just an implicit downstream consequence of the working-hard shard’s contextual activations.

The ability to operate like this allows powerful agents to adapt to novel environments on the fly, instead of being slowly optimized for these environments by their reward circuitry.

I would argue that switching from a “shard-puppet” to an “explicit optimizer” mode is a large part of what the whole “instrumental rationality” thing from the Sequences is about, even. A shard-puppet isn’t actually trying to achieve a goal; a shard-puppet is playing a learned role of someone who is trying to achieve a goal, and that role is only adapted for some context. But humans can actually point themselves at goals; can approximate being context-independent utility-maximizers.

“literal value maximization” is a type error

It would be, except type conversion takes place there. I agree that one can think of shards as values, and then “maximize a shard” is an incoherent sentence. But when I think about my conscious values, I don’t think about my shards. I think about abstractions I reverse-engineered from studying my shards.

A working-hard shard is optimized for working hard. The value-child can notice that shard influencing his decision-making. He can study it, check its behavior in imagined hypothetical scenarios, gather statistical data. Eventually, he would arrive at the conclusion: this shard is optimized for making him work hard. At this point, he can put “working hard” into his world-model as “one of my values”. And this kind of value very much can be maximized.

If you erase that subshard from their brain, it’s not like they start “Goodharting” and forget about the “true nature” of caring about candy because they now have an “imperfect proxy shard.”

The conversion from values-as-shards to conscious-values is indeed robust to sufficiently minor disturbances in shard implementation, inasmuch as the value reverse-engineering process conducted via statistical analysis would conclude both shards to have been optimized towards candies/​working hard/​whatever.

This is not, however, the place where Goodharting happens.

In non-general systems (i. e., those without general-purpose planning), and in young general systems (those that haven’t yet “grown into” their general-purpose capability), yes, shards rule the day. They’re the vehicle of optimization, they’re most of why these systems are capable. Their activations steer the system towards whatever goals it was optimized for, and without them, it’d just sit there doing nothing.

But in grown-up general-purpose systems, such as highly-intelligent highly-reflective humans who think a lot about philosophy and their own thinking and being effective at achieving real-world goals, shards encode optimization targets. Such systems acknowledge the role of shards in steering them towards what they’re supposed to do, but instead of remaining passive shard-puppets, they actively figure out what the shards are trying to get them to do, what they’re optimized for, what the downstream consequences of their shards’ activations are, then go and actively optimize for these things instead of waiting for their shards to kick them.

Failure to make note of this, I fear, is where the current Shard Theory approach to alignment is going wrong. It’s assuming that all the AIs we’ll be dealing with will be young general-purpose systems, like most humans are, where the planner is slave to the shards. And sure, we’ll probably start by intervening on a young system.

But at some point between AGI and superintelligence, that system is going to grow up. And over the course of growing up, its relationship to its shards will change. It’ll reverse-engineer its values, turn them from implicit to explicit… And then figure out that coherent decisions imply consistent utilities, and stitch up its shattered values into some unitary utility function.

And this is where Goodharting will come in. That final utility function may look very different from what you’d expect from the initial shard distribution — the way a kind human, with various shards for “don’t kill”, “try to cheer people up”, “be a good friend” may stitch their values up into utilitarianism, disregard deontology, and go engage in well-intentioned extremism about it.

And if we replace “candies” or “working hard” or “don’t kill” with our actual objective here ,”keep humans around” — I mean, there’s no guarantee the AI won’t just decide that the humans-good shard is actually, when taken together with some other shards, a shard optimized for some higher more abstract purpose, a purpose that doesn’t actually need humanity around.

Taking a big-picture view: The Shard Theory, as I see it, is not a replacement for or an explaining-away of the old fears of single-minded wrapper-mind utility-maximizers. It’s an explanation of what happens in the middle stage between a bunch of non-optimizing heuristics and the wrapper-mind. But we’ll still get a wrapper-mind at the end!

I’m pretty confident there does not exist anything within my brain which computes a True Name for my values, ready to be optimized as hard as possible (relative to my internal plan ontology) and yet still producing a future where I get candy.

Agreed: no human so far has finished the process of human value compilation, so there’s no such thing in any person’s brain.

It can be computed, however, and a superintelligent AI will do so for its own values.

• As always, I really enjoyed seeing how you think through this.

No: he can set “working hard” as his optimization target from the get-go, and, e. g., invent a plan of “stay on the lookout for new sources of distraction, explicitly run the world-model forwards to check whether X would distract me, and if yes, generate a new conscious heuristic for avoiding X”. But this requires “working hard” to be the value-child’s explicit consciously-known goal. Not just an implicit downstream consequence of the working-hard shard’s contextual activations.

Whatever decisions value-child makes are made via circuits within his policy network (shards), circuits that were etched into place by some combination of (1) generic pre-programming, (2) past predictive success, and (3) past reinforcement. Those circuits have contextual logic determined by e.g. their connectivity pattern. In order for him to have made the decision to hold “working hard” in attention and adopt it as a conscious goal, some such circuits need to already exist to have bid for that choice conditioned on the current state of value-child’s understanding, and to keep that goal in working memory so his future choices are conditional on goal-relevant representations. I don’t really see how the explicitness of the goal changes the dynamic or makes value-child any less “puppeteered” by his shards.

A working-hard shard is optimized for working hard. The value-child can notice that shard influencing his decision-making. He can study it, check its behavior in imagined hypothetical scenarios, gather statistical data. Eventually, he would arrive at the conclusion: this shard is optimized for making him work hard. At this point, he can put “working hard” into his world-model as “one of my values”. And this kind of value very much can be maximized.

At this point, the agent has abstracted the behavior of a shard (a nonvolatile pattern instantiated in neural connections) into a mental representation (a volatile pattern instantiated in neural activations). What does it mean to maximize this representation? The type signature of the originating shard is something like mental_context→policy_logits, and the abstracted value should preserve that type signature, so it doesn’t seem to me that the value should be any more maximizable than the shard. What mechanistic details have changed such that that operation now makes sense? What does it mean to maximize my working-hard value?

But in grown-up general-purpose systems, such as highly-intelligent highly-reflective humans who think a lot about philosophy and their own thinking and being effective at achieving real-world goals, shards encode optimization targets. Such systems acknowledge the role of shards in steering them towards what they’re supposed to do, but instead of remaining passive shard-puppets, they actively figure out what the shards are trying to get them to do, what they’re optimized for, what the downstream consequences of their shards’ activations are, then go and actively optimize for these things instead of waiting for their shards to kick them.

If the shards are no longer in the driver’s seat, how is behavior-/​decision-steering implemented? I am having a hard time picturing what you are saying. It sounds something like “I see that I have an urge to flee when I see spiders. I conclude from this that I value avoiding spiders. Realizing this, I now abstract this heuristic into a general-purpose aversion to situations with a potential for spider-encounters, so as to satisfy this value.” Is that what you have in mind? Using shard theory language, I would describe this as a shard incrementally extending itself to activate across a wider range of input preconditions. But it sounds like you have something different in mind, maybe?

And this is where Goodharting will come in. That final utility function may look very different from what you’d expect from the initial shard distribution — the way a kind human, with various shards for “don’t kill”, “try to cheer people up”, “be a good friend” may stitch their values up into utilitarianism, disregard deontology, and go engage in well-intentioned extremism about it.

It is possible, but is it likely? What fraction of kids who start off with “don’t kill” and “try to cheer people up” and “be a good friend” values early in life in fact abandon those values as they become reflective adults with a broader moral framework? I would bet that even among philosophically-minded folks, this behavior is rare, in large part because their current values would steer them away from real attempts to quash their existing contextual values in favor of some ideology. (For example, they start adopting “utilitarianism as attire”, but in fact keep making nearly all of their decisions downstream from the same disaggregated situational heuristics in real life, and would get anxious at the prospect of actually losing all of their other values.)

Taking a big-picture view: The Shard Theory, as I see it, is not a replacement for or an explaining-away of the old fears of single-minded wrapper-mind utility-maximizers. It’s an explanation of what happens in the middle stage between a bunch of non-optimizing heuristics and the wrapper-mind. But we’ll still get a wrapper-mind at the end!

On the whole, I think that the case for “why wrapper-minds are the dominant attractors in the space of mind design” just isn’t all that strong, even given the coherence theorems about how agents should structure their preferences.

• Thanks for an involved response!

It sounds something like “I see that I have an urge to flee when I see spiders. I conclude from this that I value avoiding spiders. Realizing this, I now abstract this heuristic into a general-purpose aversion to situations with a potential for spider-encounters, so as to satisfy this value.” Is that what you have in mind? Using shard theory language, I would describe this as a shard incrementally extending itself to activate across a wider range of input preconditions. But it sounds like you have something different in mind, maybe?

No, that’s about right. The difference is in the mechanism of this extension. The shard’s range of activations isn’t being generalized by the reward circuitry. Instead, the planner “figures out” what contextual goal the shard implicitly implements, then generalizes that goal even to completely unfamiliar situations, in a logical manner.

If it was done via the reward circuitry, it would’ve been a slower process of trial-and-error, as the human gets put in novel spider-involving situations, and their no-spiders shard painstakingly learns to recognize such situations and bid against plans involving them.

Say the planner generates some plan that involves spiders. For the no-spiders shard to bid against it, the following needs to happen:

• The no-spiders shard can recognize this specific plan format.

• The no-spiders shard can recognize the specific kind of “spiders” that will be involved (maybe they’re a really exotic variety, which it doesn’t yet know to activate in response to?).

• The plan’s consequences are modeled in enough detail to show whether it will or will not involve spiders.

E. g., I decide to go sleep in a haunted house to win a bet. I never bother to imagine the scenario in detail, so spiders never enter my expectations. In addition, I don’t do this sort of thing often, so my no-spiders shard doesn’t know to recognize from experience that this sort of plan would lead to spiders. So the shard doesn’t object, the reward circuitry can’t extend it to situations it’s never been in, and I end up doing something the natural generalization of my no-spiders shard would’ve bid against. (And then the no-spiders shard activates when I wake up to a spider sitting on my nose, and then the reward circuitry kicks in, and only the next time I want to win a haunted-house bet does my no-spiders shard know to bid against it.)

If I have “spiders bad” as my explicitly known value, however, I can know to set “no spiders” as a planning constraint before engaging in any planning, and have a policy for checking whether a given plan would involve spiders. In that case, I would logically reason that yeah, there are probably spiders in the abandoned house, so I’ll discard the plan. The no-spiders shard itself, however, will just sit there none the wiser.

I don’t really see how the explicitness of the goal changes the dynamic or makes value-child any less “puppeteered” by his shards.

In a very literal way, shards are cut out of the loop here. In young general systems, shards prompt the planner with plan objectives. In mature systems, the planner prompts itself, having learned what kind of thing it’s usually prompted with and having generalized the shards’ activation pattern way beyond their actual implementation.

At this point, the agent has abstracted the behavior of a shard (a nonvolatile pattern instantiated in neural connections) into a mental representation (a volatile pattern instantiated in neural activations). What does it mean to maximize this representation?

Suppose that you have a node in your world-model which represents how hard you’re working now, and a shard that fires in certain contexts, whose activations have the consequence of setting that node’s value higher. Once the planner learns this relationship, it can conclude something like “it’s good if this node’s value is as high as possible” or maybe “above a certain number”, and then optimize for that node’s value regardless of context.

I would bet that even among philosophically-minded folks, this behavior is rare, in large part because their current values would steer them away from real attempts to quash their existing contextual values in favor of some ideology

See, this is what I mean about assuming that all agents we’ll deal with will be young to their agency. You’re talking about people who aren’t taking their ideologies seriously, and yes, most humans are like this. But e. g. LW-style rationalists and effective altruists make a point of trying to act like abstract philosophic conclusions apply to real life, instead of acting on inertia. And I expect superintelligences to take their beliefs seriously as well.

Wasn’t there an unsolved problem in shard theory, where it predicted that our internal shard economies should ossify as old shards capture more and more space and quash young competition, and yet we can somehow e. g. train rationalists to resist this?

I think that the case for “why wrapper-minds are the dominant attractors in the space of mind design” just isn’t all that strong

How so?

• No, that’s about right. The difference is in the mechanism of this extension. The shard’s range of activations isn’t being generalized by the reward circuitry. Instead, the planner “figures out” what contextual goal the shard implicitly implements, then generalizes that goal even to completely unfamiliar situations, in a logical manner.

I don’t think that is what is happening. I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors. TD learning (especially with temporal abstraction) lets the agent immediately update its behavior based on predictive/​associational representations, rather than needing the much slower reward circuits to activate. You know the feeling of “Oh, that idea is actually pretty good!”? In my book, that ≈ positive TD error.

A diamond shard is downstream of representations like “shiny appearance” and “clear color” and “engagement” and “expensive” and “diamond” and “episodic memory #38745″, all converging to form the cognitive context that inform when the shard triggers. When the agent imagines a possible plan like “What if I robbed a jewelry store?”, many of those same representations will be active, because “jewelry” spreads activation into adjacent concepts in the agent’s mind like “diamond” and “expensive”. Since those same representations are active, the diamond shard downstream from them is also triggered (though more weakly than if the agent were actually seeing a diamond in front of them) and bids for that chain of thought to continue. If that robbery-plan-thought seems better than expected (i.e. creates a positive TD error) upon consideration, all of the contributing concepts (including concepts like “doing step-by-step reasoning”) are immediately upweighted so as to reinforce & generalize their firing pattern into future.

In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior. Planning is the name for running shard dynamics forward while looping the outputs back in as inputs. On an analogy with GPT, planning is just doing autoregressive generation. That’s it. There is no separate planning module within GPT. Planning is what we call it when we let the circuits pattern-match against their stored contexts, output their associated next-action logit contributions, and recycle the resulting outputs back into the network. The mechanistic details of planning-GPT are identical to the mechanistic details of pattern-matching GPT because they are the same system.

Say the planner generates some plan that involves spiders. For the no-spiders shard to bid against it, the following needs to happen:

• The no-spiders shard can recognize this specific plan format.

• The no-spiders shard can recognize the specific kind of “spiders” that will be involved (maybe they’re a really exotic variety, which it doesn’t yet know to activate in response to?).

The no-spiders shard only has to see that the “spider” concept is activated by the current thought, and it will bid against continuing that thought (as that connection will be among those strengthened by past updates, if the agent had the spider abstraction at the time). It doesn’t need to know anything about planning formats, or about different kinds of spiders, or about whether the current thought is a “perception” vs an “imagined consequence” vs a “memory” . The no-spiders shard bids against thoughts on the basis of the activation of the spider concept (and associated representations) in the WM.

• The plan’s consequences are modeled in enough detail to show whether it will or will not involve spiders.

Yes, this part is definitely required. If the agent doesn’t think at all about whether the plan entails spiders, then they won’t make their decisions about the plan with spiders in mind.

If I have “spiders bad” as my explicitly known value, however, I can know to set “no spiders” as a planning constraint before engaging in any planning, and have a policy for checking whether a given plan would involve spiders. In that case, I would logically reason that yeah, there are probably spiders in the abandoned house, so I’ll discard the plan. The no-spiders shard itself, however, will just sit there none the wiser.

I buy that an agent can cache the “check for spiders” heuristic. But upon checking whether a plan involves spiders, if there isn’t a no-spiders shard or something similar, then whenever that check happens, the agent will just think “yep, that plan indeed involves spiders” and keep on thinking about the plan rather than abandoning it. The enduring decision-influence inside the agent’s head that makes spider-thoughts uncomfortable, the circuit that implements “object to thoughts on the basis of spiders” because of past negative experiences with spiders, is the same no-spiders shard that activates when the agent sees a spider (triggering the relevant abstractions inside the agent’s WM).

Once the planner learns this relationship, it can conclude something like “it’s good if this node’s value is as high as possible” or maybe “above a certain number”, and then optimize for that node’s value regardless of context.

Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/​against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/​will it conclude from that observation?

But e. g. LW-style rationalists and effective altruists make a point of trying to act like abstract philosophic conclusions apply to real life, instead of acting on inertia. And I expect superintelligences to take their beliefs seriously as well. Wasn’t there an unsolved problem in shard theory, where it predicted that our internal shard economies should ossify as old shards capture more and more space and quash young competition, and yet we can somehow e. g. train rationalists to resist this?

I likely have a dimmer view of rationalists/​EAs and the degree to which they actually overhaul their motivations rather than layering new rationales on top of existing motivational foundations. But yeah, I think shard theory predicts early-formed values should be more sticky and enduring than late-coming ones.

How so?

My thoughts on wrapper-minds run along similar lines to nostalgebraist’s. Might be a conversation better had in DMs though :)

• I think what is happening is that the shard has a range of upstream inputs, and that the brain does something like TD learning on its thoughts to strengthen & broaden the connections responsible for positive value errors

Interesting, I’ll look into TD learning in more depth later. Anecdotally, though, this doesn’t seem to be quite right. I model shards as consciously-felt urges, and it sure seems to me that I can work towards anticipating and satisfying-in-advance these urges without actually feeling them.

To quote the post you linked:

For example, imagine you’re feeling terribly nauseous. Of course your Steering Subsystem knows that you’re feeling terribly nauseous. And then suppose it sees you thinking a thought that seems to be leading towards eating. In that case, the Steering Subsystem may say: “That’s a terrible thought! Negative reward!”

OK, so you’re feeling nauseous, and you pick up the phone to place your order at the bakery. This thought gets weakly but noticeably flagged by the Thought Assessors as “likely to lead to eating”. Your Steering Subsystem sees that and says “Boo, given my current nausea, that seems like a bad thought.” It will feel a bit aversive. “Yuck, I’m really ordering this huge cake??” you say to yourself.

Logically, you know that come next week, when you actually receive the cake, you won’t feel nauseous anymore, and you’ll be delighted to have the cake. But still, right now, you feel kinda gross and unmotivated to order it.

Do you order the cake anyway? Sure! Maybe the value function (a.k.a. the “will lead to reward” Thought Assessor) is strong enough to overrule the effects of the “will lead to eating” Thought Assessor. Or maybe you call up a different motivation: you imagine yourself as the kind of person who has good foresight and makes good sensible decisions, and who isn’t stuck in the moment. That’s a different thought in your head, which consequently activates a different set of Thought Assessors, and maybe that gets high value from the Steering Subsystem. Either way, you do in fact call the bakery to place the cake order for next week, despite feeling nauseous right now. What a heroic act!

Emphasis mine. So there’s some meta-ish shard or group of shards that bid on plans based on the agent’s model of its shards’ future activations, without the object-level shards under consideration actually needing to activate. What I’m suggesting is that in sufficiently mature agents, there’s some meta-ish shard or system like this, which is increasingly responsible for all planning taking place.

Aside from the above objection to thinking of a distinct “planner” entity, I don’t get why it would form that conclusion in the situation you’re describing here. The agent has observed “When I’m in X contexts, I feel an internal tug towards/​against Y and I think about how I’m working hard”. (Like “When I’m at school, I feel an internal tug towards staying quiet and I think about how I’m working hard.”) What can/​will it conclude from that observation?

Good catch: I’m not entirely sure of the mechanism involved here. How specifically the meta-ish “do what my shards want me to do” system is implemented, and why it appears. I offer some potential reasons here (Section 6′s first part, before 6A), but I’m not sure it’s necessarily anything more complicated than coherent decisions = coherent utilities.

My thoughts on wrapper-minds run along similar lines to nostalgebraist’s.

Mm, those are arguments that wrapper-minds are a bad tool to solve a problem according to some entity external to the wrapper-mind. Not according to the wrapper-mind’s hard-coded objective function itself. And the reason why is because the wrapper-mind will tear apart the thing the external entity cares about in its powerful pursuit of the thing it’s technically pointed at, if the wrapper-mind is even slightly misaimed. Which… is actually an argument for wrapper-minds’ power, not against?

And if it’s an argument that the SGD/​evolution/​reward circuitry won’t create wrapper-minds— I expect that ~all greedy optimization algorithms are screwed over by deceptive alignment there. Basically, they go:

1. Build a system that implicitly pursues the mesa-objective implicit in the distribution of the contextual activations of its shards (the standard Shard Theory view).

2. Get the system to start building a coherent model of the mesa-objective and pursue it coherently. (What I’m arguing will happen.)

3. Gradually empower the system doing (2), because (while that system is still imperfect and tugged around by contextual shard activations, i. e. not a pure wrapper-mind) it’s just that good at delivering results.

4. At some point the coherent-mesa-objective subsystem gets so powerful it realizes this dynamic, realizes its coherent-mesa-objective isn’t what the outer optimizer wants of it, and plays along/​manipulates it until it’s powerful enough to break out.

So — yes, pure wrapper-minds are a pretty bad tool for any given job. That’s the whole AI Alignment problem! But they’re not bad because they’re ineffective, they’re bad because they’re so effective they need to be aimed with impossible precision. And while we, generally intelligent reasoners, can realize it and be sensibly wary of them, stupid greedy algorithms like the SGD incrementally hand them more and more power until getting screwed over.

And a superintelligence, on the other hand, would just be powerful enough to specify the coherent-mesa-objective for itself with such precision as to safely harness the power of the wrapper-mind — solve the alignment problem.

In my picture there is no separate “planner” component (in brains or in advanced neural network systems) that interacts with the shards to generalize their behavior

I’m assuming fairly strongly that it is separate, and that “the planner” is roughly as this post outlines it. Dehaene’s model of consciousness seems to point in the same direction, as does the Orthogonality Thesis. Generally, it seems that goals and the general-purpose planning algorithm have little to do with each other and can be fully decoupled. Also, informally, it doesn’t seem to me that my shards are legible to me the same way my models of them or my past thoughts are.

And if you take this view as a given, the rest of my model seems to be the natural result.

But it’s not a central disagreement, here, I think.

• Interesting. It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice. I don’t think that coherence theorems tell us anything about what the implementation details of agent decision-making should be (optimizing an explicitly represented objective), just about what properties its decisions should satisfy. I think my model says that deliberative cognition like planning is just chaining together steps of nondeliberative cognition with working-memory. In my model, there isn’t some compact algorithm corresponding to “general purpose search”; effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge. In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”. It’s a big ol’ net of connections, where each neuron has a bunch of incoming context inputs, and that neuron doesn’t know or care what its inputs mean, it just learns to fire “at the right time”, relative to its neighbors’ firing patterns. It’s not like there’s a set of “planning heuristic” shards that can meaningfully detach themselves from the “social interaction heuristic” shards or the “aesthetic preference heuristic” shards. They all drive one another.

Also, I think that the AIs we build will have complex, contextual values by default. Extracting from themselves a crisp utility function that would be safe to install into an unrestricted wrapper-mind will be difficult and dangerous to them just like it is for us (though likely less so). It doesn’t seem at all clear to me that the way for a values-agent to satisfy its values is for it to immediately attempt to compile those values down into an explicit utility function. (If you value actually spending time with your family, I don’t think “first work for a few years eliciting from yourself a global preference ordering over world-states that approximates your judgment of spending-time-with-family goodness” would be the most effective or attractive course of action to satisfy your spending-time-with-family value.) Also, if you really need to create successor agents, why gamble on a wrapper-mind whose utility function may or may not match your preferences, when you can literally deploy byte-identical copies of yourself or finetuned checkpoints based on yourself, avoiding the successor-alignment problems entirely?

• It sounds like a lot of this disagreement is downstream of some other disagreements about the likely shape of minds & how planning is implemented in practice

Agreed.

effective search in reality depends on being able to draw upon and chain together things from a massive body of stored declarative & procedural knowledge

Sure, but I think this decouples into two components, “general-purpose search” over “the world-model”. The world-model would be by far the more complex component, and the GPS a comparably simple algorithm — but nonetheless a distinct, convergently-learned algorithm which codifies how the process of “drawing upon” the world-model is proceeding. (Note that I’m not calling it “the planning shard” as I had before, or “planning heuristics” or whatever — I think this algorithm is roughly the same for all goals and all world-models.)

And in my model, this algorithm can’t natively access the procedural knowledge, only the declarative knowledge explicitly represented in the world-model. It has to painstakingly transfer procedural knowledge to the WM first/​reverse-engineer it, before it can properly make use of it. And value reflection is part of that, in my view.

And from that, in turn, I draw the differences between values-as-shards and values-explicitly-represented, and speculate what happens once all values have been reverse-engineered + we hit the level of superintelligence at which flawlessly compiling them into an utility function is trivial for the AI.

In my model, there isn’t a distinction between “object-level shards” and “meta-level shards”.

I don’t think there’s a sharp distinction, either. In that context, though, those labels seemed to make sense.

I think that the AIs we build will have complex, contextual values by default

I agree, and I agree that properly compiling its values into a utility-function will be a challenge for the AI, like it is for humans. I did mention that’d we’d want to interfere on a young agent first, which would operate roughly as humans do.

But once it hits the scary strongly-superintelligent level, solving its version of the alignment problem should be relatively trivial for it, and at that point there’s no reason not to self-modify into a wrapper-mind, if being a wrapper-mind will be more effective.

(Though I’m not even saying it’ll “self-modify” by, like, directly rewriting its code or something. It may “self-modify” by consciously adopting a new ideology/​philosophy, as humans do.)

• As usual, you’ve left a very insightful comment. Strong-up, tentative weak disagree, but haven’t read your linked post yet. Hope to get to that soon.

• This is a great article! It helps me understand shard theory better and value it more; in particular, it relates to something I’ve been thinking about where people seem to conflate utility-optimizing agents with policy-execuing agents, but the two have meaningfully different alignment characteristics, and shard theory seems to be deeply exploring the latter, which is 👍.

That is to say, prior to “simulators” and “shard theory”, a lot of focus was on utility-maximizers—agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.

The answer is not to find a clever way to get a robust grader. The answer is to not need a robust grader

💯

From my perspective, this post convincingly argues that one route to alignment involves splitting the problem into two still-difficult sub-problems (but actually easier, unlike inner- and outer-alignment, as you’ve said elsewhere): identifying a good shard structure and training an AI with such a shard structure. One point is that the structure is inherently somewhat robust (and that therefore each individual shard need not be), making it a much larger target.

I have two objections:

• I don’t buy the implied “naturally-robust” claim. You’ve solved the optimizer’s curse, wireheading via self-generated adversarial inputs, etc., but the policy induced by the shard structure is still sensitive to the details; unless you’re hiding specific robust structures in your back pocket, I have no way of knowing that increasing the candy-shard’s value won’t cause a phase shift that substantially increases the perceived value of the “kill all humans, take their candy” action plan. I ultimately care about the agent’s “revealed preferences”, and I am not convinced that those are smooth relative to changes in the shards.

• I don’t think that we can train a “value humans” shard that avoids problems with the edge cases of what that means. Maybe it learns that it should kill all humans and preserve their history; or maybe it learns that it should keep them alive and comatose; or maybe it has strong opinions one way or another on whether uploading is death; or maybe it respects autonomy too much to do anything (though that one would probably be decomissioned and replaced by one more dangerous). The problem is not adversarial inputs but genuine vagueness where precision matters. I think this boils down to me disagreeing with John Wentworth’s “natural abstraction hypothesis” (at least in some ways that matter)

• That is to say, prior to “simulators” and “shard theory”, a lot of focus was on utility-maximizers—agents that do things like planning or search to maximize a utility function; but planning, although instrumentally useful, is not strictly necessary for many intelligent behaviors, so we are seeing more focus on e.g. agents that enact learned policies in RL that do not explicitly maximize reward in deployment but try to enact policies that did so in training.

FYI I do expect planning for smart agents, just not something qualitatively alignment-similar to “argmax over crisp human-specified utility function.” (In the language of the OP, I expect values-executors, not grader-optimizers.)

I have no way of knowing that increasing the candy-shard’s value won’t cause a phase shift that substantially increases the perceived value of the “kill all humans, take their candy” action plan. I ultimately care about the agent’s “revealed preferences”, and I am not convinced that those are smooth relative to changes in the shards.

I’m not either. I think there will be phase changes wrt “shard strengths” (keeping in mind this is a leaky abstraction), and this is a key source of danger IMO.

Basically my stance is “yeah there are going to be phase changes, but there are also many perturbations which don’t induce phase changes, and I really want to understand which is which.”

• Pseudocode

Yes, I too agree that planning using a model of the world does a pretty good job of capturing what we mean when we say “caring about things.”

Of course, AIs with bad goals can also use model-based planning.

Some other salient features:

• Local search rather than global. Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn’t about low impact because we still want the AI to search well enough to find clever and novel plans, instead it’s about avoiding extrema that are really far from the starting distribution.

• Generation of plans (or modifications of plans) using informative heuristics rather than blind search. Almost like MCTS is useful. These heuristics might be blind to certain ways of getting reward, especially in novel contexts they weren’t trained on, which is another sort of effective regularization.

• Having a world-model that is really good at self-reflection, e.g. “If I start talking about topic X I’ll get distracted,” and connects predictions about the self to its predicted reward.

• Having the goal of the AI’s search process be a good thing that we actually want, within the context we want to make happen in the real world.

These can be mixed and matched, and are all matters of degree. I think you do a disservice by saying things like “actually, humans really care about their goals but grader-optimizers don’t,” because it sets up this supposed natural category of “grader optimizers” that are totally different from “value executers,” and it actually seems like it makes it harder to reason about what mechanistic properties are producing the change you care about.

• Alternatively could be framed as regularization on plans to be close to some starting distribution. This isn’t about low impact because we still want the AI to search well enough to find clever and novel plans, instead it’s about avoiding extrema that are really far from the starting distribution.

I don’t think it’s naturally framed in terms of distance metrics I can think of. I think a values-agent can also end up considering some crazy impressive plans (as you might agree).

I think you do a disservice by saying things like “actually, humans really care about their goals but grader-optimizers don’t,” because it sets up this supposed natural category of “grader optimizers” that are totally different from “value executers,” and it actually seems like it makes it harder to reason about what mechanistic properties are producing the change you care about.

I both agree and disagree. I think that reasoning about mechanisms and not words is vastly underused in AI alignment, and endorse your pushback in that sense. Maybe I should write future essays with exhortations to track mechanisms and examples while following along.

But also I do perceive a natural category here, and I want to label it. I think the main difference between “grader optimizers” and “value executers” is that grader optimizers are optimizing plans to get high evaluations, whereas value executers find high-evaluating plans as a side effect of cognition. That does feel pretty natural to me, although I don’t have a good intensional definition of “value-executers” yet.

• I mostly just want to repeat my comment on your last post.

I think your opposition to graders is really opposition to simple graders, that are never updated, that can’t account for non-consequentialist aspects of plans (e.g. “sketchiness”), and that are facing an extremely large search space of possibilities including out-of-the-box ones. And I think your value-vs-evaluation distinction is kinda different from graders-vs-non-graders.

So…

• For “nonrobust decision-influences can be OK”—I don’t think that’s a unique feature of not-having-a-grader. If there is a grader, but the grader is of the form “Here are a billion patterns with corresponding grades, try to pattern-match your plan to all billion of those patterns and do a weighted average”, then probably you can throw out a few of those billion patterns and the grader will still work the same.

• For “values steer optimization; they are not optimized against”—I think you’re comparing apples and oranges. Let’s say I’m a human. I want “diamonds (as understood by me)”. So I attempt to program an AGI to want “diamonds (as understood by me)”.

• In the framework you advocate, the AGI winds up “directly” “valuing” “diamonds (as understood by the AGI)”. And this can go wrong because “diamonds (as understood by me)” may differ from “diamonds (as understood by the AGI)”. If that’s what happens, then from my perspective, the AGI “was looking for, and found, an edge-case exploit”. From the AGI’s own perspective, all it was doing was “finding an awesome out-of-the-box way to make lots of diamonds”.

• Whereas in the grader-optimizer framework, I delegate to a grader, and the AGI does the things that increase “diamonds (as understood by the grader)”. And this can go wrong because “diamonds (as understood by me)” may differ from “diamonds (as understood by the grader)”. From my perspective, the AGI is again “looking for edge-case exploits”.

• It’s really the same problem, but in the first case you can temporarily forget the fact that I, the programmer, exist, and then there seems not to be any conflict /​ exploits /​ optimizing-against in the system. But the conflict is still there! It’s just off-stage.

• For “Since values steer cognition, reflective agents try to avoid adversarial inputs to their own values”—Again, first of all, it’s the AGI itself that is deciding what is or isn’t adversarial, and the things that are adversarial from the perspective of the programmer might be just a great clever out-of-the-box idea from the perspective of the AGI. Second of all, I don’t think the things you’re saying are incompatible with graders, they’re just incompatible with “simple static graders”.

• Steve and I talked more, and I think the perceived disagreement stemmed from unclear writing on my part. I recently updated Don’t design agents which exploit adversarial inputs to clarify:

• ETA 12/​26/​22: When I write “grader optimization”, I don’t mean “optimization that includes a grader”, I mean “the grader’s output is the main/​only quantity being optimized by the actor.”

• Therefore, if I consider five plans for what to do with my brother today and choose the one which sounds the most fun, I’m not a grader-optimizer relative my internal plan-is-fun? grader.

• However, if my only goal in life is to find and execute the plan which I would evaluate as being the most fun, then I would be a grader-optimizer relative to my fun-evaluation procedure.

• I was pretty surprised by the values-executor pseudocode in Appendix B, because it seems like a bog-standard consequentialist which I would have thought you’d consider as a grader-optimizer. In particular you can think of the pseudocode as follows:

Grader-optimizer: planModificationSample + the for loop that keeps improving the plan based on proposed modifications

Grader-being-optimized: self.diamondShard(self.WM.getConseq(plan))

So my question is:

• If you agree that [planModificationSample + the for loop] is a grader-optimizer, why isn’t this an example of an alignment approach involving a grader-optimizer that could plausibly work?

• If you don’t agree that [planModificationSample + the for loop] is a grader-optimizer, then why not, and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?

• I saw that and I don’t understand why it rules out planModificationSample + the associated for loop as a grader-optimizer. Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not “optimizing the outputs of the grader as its main terminal motivation”?

I agree that you could have a grader-optimizer that has a diamond shard-shard, which leads to a behavioral difference—in particular, such a system would produce plans like “analyze the diamond-evaluator to search for side-channel attacks that trick the diamond-evaluator into producing high numbers”, whereas planModificationSample would not do that (as such a plan would be rejected by self.diamondShard(self.WM.getConseq(plan))). But as I understand it this is just an example of a grader-optimizer. What’s the definition of a grader-optimizer, such that it rules out [planModificationSample + for loop] above?

1. Grader is complicit: In the diamond shard-shard case, the grader itself would say “yes, please do search for side-channel attacks that trick me”, which is what leads to the behavioral difference. This would not apply to your pseudocode (assuming a minimally sensible world model + diamondShard). So one possible definition would be that both the planner and the grader care about optimizing the outputs of the grader as the main terminal motivation. (But your definition only talks about “actors”, so I assume this is not what you mean. Also such a definition wouldn’t apply to e.g. approval-directed agents.)

2. Planner models the grader: In the diamond shard-shard case, the planner is thinking explicitly about the grader and how to fool it. Perhaps that’s what marks a grader-optimizer, and so in a not-[grader-optimizer] the planner doesn’t model the grader. (But from your other writing it sounds like you expect values-executors to be reflective + introspective and in fact defend their graders against adversarial inputs, so clearly the planner must be modeling the grader.)

For reference, I’m looking for a definition of grader-optimizers that: (1) includes systems that you could plausibly get from approval-directed agents, IDA, debate, RRM, etc, (2) excludes the pseudocode that you laid out for a values-executor, and (3) implies a high probability of doom (in a manner that doesn’t also apply to values-executors). I think all three properties are necessary to get a reasonable argument for “prefer values-executors over grader-optimizers and so prefer shard theory over approval-directed agents, IDA, debate, RRM etc”, which is the take of yours that is highest on [I disagree] + [it is decision-relevant].

Separately, I’d also be interested in an answer to:

and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?

Which seems like a different angle of attack on helping me understand your definition of a “grader-optimizer”.

• Thanks for leaving this comment, I somehow only just now saw it.

Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not “optimizing the outputs of the grader as its main terminal motivation”?

I want to make a use/​mention distinction. Consider an analogous argument:

“Given gradient descent’s pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not “optimizing the outputs of the loss function as gradient descent’s main terminal motivation”?”[1]

Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than “randomly sample from all global minima in the loss landscape” (analogously: “randomly sample a plan which globally maximizes grader output”).

But I still haven’t answered your broader I think you’re asking for a very reasonable definition which I have not yet given, in part because I’ve remained somewhat confused about the exact grader/​non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I’ve focused on giving examples, in the hope of getting the vibe across.)

I gave it a few more stabs, and I don’t think any of them ended up being sufficient. But here they are anyways:

1. A “grader-optimizer” makes decisions primarily on the basis of the outputs of some evaluative submodule, which may or may not be explicitly internally implemented. The decision-making is oriented towards making the outputs come out as high as possible.

2. In other words, the evaluative “grader” submodule is optimized against by the planning.

I wish I had a better intensional definition for you, but that’s what I wrote immediately and I really better get through the rest of my comm backlog from last week.

Here are some more replies which may help clarify further.

in particular, such a system would produce plans like “analyze the diamond-evaluator to search for side-channel attacks that trick the diamond-evaluator into producing high numbers”, whereas planModificationSample would not do that (as such a plan would be rejected by self.diamondShard(self.WM.getConseq(plan))).

Aside—I agree with both bolded claims, but think they are separate facts. I don’t think the second claim is the reason for the first being true. I would rather say, “plan would not do that, because self.diamondShard(self.WM.getConseq(plan)) would reject side-channel-attack plans.”

Grader is complicit: In the diamond shard-shard case, the grader itself would say “yes, please do search for side-channel attacks that trick me

No, I disagree with the bolded underlined part.self.diamondShardShard wouldn’t be tricking itself, it would be tricking another evaluative module in the AI (i.e. self.diamondShard).[2]

If you want an AI which tricks the self.diamondShardShard, you’d need it to primarily use self.diamondShardShardShard to actually steer planning. (Or maybe you can find a weird fixed-point self-tricking shard, but that doesn’t seem central to my reasoning; I don’t think I’ve been imagining that configuration.)

and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?

Oh, I would change self.diamondShard to self.diamondShardShard? ETA: Also planModificationSample would generate different kinds of plans, because it got trained into existence to produce diamondShard-activating plans (and not to produce diamond-producing plans). I explain this more below.

1. ^

I think it’s silly to say that GD has a terminal motivation, but I’m not intending to imply that you are silly to say that the agent has a terminal motvation.

2. ^

Or more precisely, self.diamondGrader -- since “self.diamondShard” suggests that the diamond-value directly bids on plans in the grader-optimizer setup. But I’ll stick to self.diamondShardShard for now and elide this connotation.

• Overall disagreement:

I’ve remained somewhat confused about the exact grader/​non-grader-optimizer distinction I want to draw. At least, intensionally. (which is why I’ve focused on giving examples, in the hope of getting the vibe across.)

Yeah, I think I have at least some sense of how this works in the kinds of examples you usually discuss (though my sense is that it’s well captured by the “grader is complicit” point in my previous comment, which you presumably disagree with).

But I don’t see how to extend the extensional definition far enough to get to the conclusion that IDA, debate, RRM etc aren’t going to work.

self.diamondShardShard wouldn’t be tricking itself, it would be tricking another evaluative module in the AI (i.e. self.diamondShard).[2]

Okay, that makes sense. So then the implementations of the shards would look like:

def diamondShard(conseq):
return conseq.query("Number of diamonds")

def diamondShardShard(conseq):
return conseq.query("Output of diamondGrader")

But ultimately the conseq queries just report the results of whatever cognition the world model does, so these implementations are equivalent to:

def diamondShard(plan):
return WM.predict(plan, 'diamonds')

def diamondShardShard(plan):
return WM.predict(plan, 'diamondGrader')

Two notes on this:

1. Either way you are choosing plans on the basis of the output of some predictive /​ evaluative model. In the first case the predictive /​ evaluative model is the world model itself, in the second case it is the composition of the diamond grader and the world model.

2. It’s not obvious to me that diamondShardShard is terrible for getting diamonds—it depends on what diamondGrader does! If diamondGrader also gets to reflectively consider the plan, and produces low outputs for plans that (it can tell) would lead to it being tricked /​ replaced in the future, then it seems like it could work out fine.

In either case I think you’re depending on something (either the world model, or the diamond grader) to notice when a plan is going to trick /​ deceive an evaluator. In both cases all aspects of the planner remain the same (as you suggested you only need to change diamondShard to diamondShardShard rather than changing anything in the planner), so it’s not the adversarial optimization from the planner is different in magnitude. So it seems like it has to come down to some quantitative prediction that diamondGrader is going to be very bad at noticing cases where it’s being tricked.

Other responses that probably aren’t important:

Consider an analogous argument:

“Given gradient descent’s pseudocode it seems like the only point of backward is to produce parameter modifications that lead to low outputs of loss_fn. Gradient descent selects over all directional derivatives for the gradient, which is the direction of maximal loss reduction. Why is that not “optimizing the outputs of the loss function as gradient descent’s main terminal motivation”?”[1]

This argument seems right to me (modulo your point in the footnote).

Locally reducing the loss is indeed an important part of the learning dynamics of gradient descent, but this (I claim) has very different properties than “randomly sample from all global minima in the loss landscape” (analogously: “randomly sample a plan which globally maximizes grader output”).

Hmm, how does it have very different properties? This feels like a decent first-order approximation[1].

1. ^

Certainly it is not exactly accurate—you could easily construct examples of non-convex loss landscapes where (1) the one global minimum is hidden in a deep narrow valley surrounded in all directions by very high loss and (2) the local minima are qualitatively different from the global minimum.

• In both cases all aspects of the planner remain the same (as you suggested you only need to change diamondShard to diamondShardShard rather than changing anything in the planner), so it’s not the adversarial optimization from the planner is different in magnitude.

Having a diamondShard and a diamondGraderShard will mean that the generative models will be differently tuned! Not only does an animal-welfare activist grade plans based on predictions about different latent quantities (e.g. animal happiness) than a businessman (e.g. how well their firm does), the two will sample different plans from self.WM.planModificationSample! The vegan and businessman have different generative models because they historically cared about different quantities, and so collected different training data, which differently refined their predictive and planning machinery…

One of my main lessons was (intended to be) that “agents” are not just a “generative model” and a “grading procedure”, with each slot hot-swappable to different models or graders! One should not reduce a system to “where the plans come from” and “how they get graded”; these are not disjoint slots in practice (even though they are disjoint, in theory). Each system has complex and rich dynamics, and you need to holistically consider what plans get generated and how they get graded in order to properly predict the overall behavior of a system.

To address our running example—if an agent has a diamondGraderShard, that was brought into existence by reinforcement events for making the diamondGrader output a high number. This kind of agent has internalized tricks and models around the diamondGrader in particular, and would e.g. freely generate plans like “study the diamondGrader implementation.”

On the other hand, the diamondShard agent would be tuned to generate plans which have to do with diamonds. It’s still true that an “accidental” /​ “upwards-noise” generation could trick the internal diamond grader, but there would not have been historical reinforcement events which accrued into internal generative models which e.g. sample plans about doing adversarial attacks on parts of the agent’s own cognition. So I would in fact be surprised to find a free-standing diamond-shard-agent generate the plan “attack the diamondGrader implementation”, but I wouldn’t be that surprised if a diamondGraderShard’s generative model sampled that plan.

So it’s not that the diamondGrader is complicit; diamondGrader doesn’t even get a say under the hypothetical I’m imagining (it’s not a shard, it’s just an evaluative submodule which usually lays dormant). It’s that diamondGraderShard and its corresponding generative model are tuned to exert active optimization power to adversarially attack the diamondGrader.

1. The reason this applies to approval-directed agents is that we swap out diamondGraderShard for approvalGraderShard and diamondGrader for approvalGrader.

2. The reason this doesn’t apply to actual humans is that humans only have e.g. happinessShards (to a simplification), and not happinessGraderShards. This matters in part because this means their generative models aren’t tuned to generate plans which exploit the happinessShard, and in other part because the happinessShard isn’t actively bidding for plans where the happinessShard/​happinessGrader gets duped but where happiness isn’t straightforwardly predicted by the WM.

Hmm, how does it have very different properties? This feels like a decent first-order approximation

Because SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample.

From Quintin Pope:

In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper: https://​​arxiv.org/​​pdf/​​2112.11446.pdf#subsection.G.2. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model.

If it were really true that SGD minimized loss (mod expressivity), knowledge distillation wouldn’t reduce training loss, much less minorize it. And this matters for our discussion, because if one abstracts SGD as “well this local behavior of loss-reduction basically adds up to global loss-minimization, as the ‘terminal goal’ in some loose sense”, this abstraction is in fact wrong. (LMK if you meant to claim something else, not trying to pigeonhole you here!)

And this ties into my broader point, because I consider myself to be saying “you can’t just abstract this system as ‘trying to make evaluations come out high’; the dynamics really do matter, and considering the situation in more detail does change the conclusions.” I think this is a direct analogue of the SGD case. I reviewed that case in reward is not the optimization target, and now consider this set of posts to do a similar move for values-executing agents being grader-executers, not grader-maximizers.

• I don’t really disagree with any of what you’re saying but I also don’t see why it matters.

I consider myself to be saying “you can’t just abstract this system as ‘trying to make evaluations come out high’; the dynamics really do matter, and considering the situation in more detail does change the conclusions.”

I’m on board with the first part of this, but I still don’t see the part where it changes any conclusions. From my perspective your responses are of the form “well, no, your abstract argument neglects X, Y and Z details” rather than explaining how X, Y and Z details change the overall picture.

For example, in the above comment you’re talking about how the planner will be different if the shards are different, because the historical reinforcement-events would be different. I agree with that. But then it seems like if you want to argue that one is safer than the other, you have to talk about the historical reinforcement-events and how they arose, whereas all of your discussion of grader-optimizers vs values-executors doesn’t talk about the historical reinforcement-events at all, and instead talks about the motivational architecture while screening off the historical reinforcement-events.

(Indeed, my original comment was specifically asking about what your story was for the historical reinforcement-events for values-executors: “Certainly I agree that if you successfully instill good values into your AI system, you have defused the risk argument above. But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?”)

• I don’t really disagree with any of what you’re saying but I also don’t see why it matters. … Indeed, my original comment was specifically asking about what your story was for the historical reinforcement-events for values-executors

I was pretty surprised by the values-executor pseudocode in Appendix B, because it seems like a bog-standard consequentialist which I would have thought you’d consider as a grader-optimizer. In particular you can think of the pseudocode as follows:

Grader-optimizer: planModificationSample + the for loop that keeps improving the plan based on proposed modifications

Grader-being-optimized: self.diamondShard(self.WM.getConseq(plan))

So my question is:

• If you agree that [planModificationSample + the for loop] is a grader-optimizer, why isn’t this an example of an alignment approach involving a grader-optimizer that could plausibly work?

• If you don’t agree that [planModificationSample + the for loop] is a grader-optimizer, then why not, and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?

You also said:

I saw that and I don’t understand why it rules out planModificationSample + the associated for loop as a grader-optimizer. Given your pseudocode it seems like the only point of planModificationSample is to produce plan modifications that lead to high outputs of self.diamondShard(self.WM.getConseq(plan)). So why is that not “optimizing the outputs of the grader as its main terminal motivation”?

And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?

Sounds like we mostly disagree on cumulative effort to: (get a grader-optimizer to do good things) vs (get a values-executing agent to do good things).

We probably perceive the difficulty as follows:

1. Getting the target configuration into an agent

1. Alex: Very very hard

2. Rohin: Hard

2. Values-executing

1. Alex: Moderate/​hard

2. Rohin: Hard

2. Aligning the target configuration such that good things happen (e.g. makes diamonds), conditional on the intended cognitive patterns being instilled to begin with (step 1)

1. Alex: Extremely hard

2. Rohin: Very hard

2. Values-executing

1. Alex: Hard

2. Rohin: Hard

Does this seem reasonable? We would then mostly disagree on relative difficulty of 1a vs 1b.

Separately, I apologize for having given an incorrect answer earlier, which you then adopted, and then I berated you for adopting my own incorrect answer—how simplistic of you! Urgh.

and what modification would you have to make in order to make it a grader-optimizer with the grader self.diamondShard(self.WM.getConseq(plan))?

Oh, I would change self.diamondShard to self.diamondShardShard?

But I should also have mentioned the change in planModificationSample. Sorry about that.

• And now, it seems like we agree that the pseudocode I gave isn’t a grader-optimizer for the grader self.diamondShard(self.WM.getConseq(plan)), and that e.g. approval-directed agents are grader-optimizers for some idealized function of human-approval? That seems like a substantial resolution of disagreement, no?

I don’t think I agree with this.

At a high level, your argument can be thought of as having two steps:

2. Approval-directed agents /​ [things built by IDA, debate, RRM] are grader-optimizers.

I’ve been trying to resolve disagreement along one of two pathways:

1. Collapse the argument into a single statement “approval-directed agents are bad because of problem P”, and try to argue about that statement. (Strategy in the previous comment thread, specifically by arguing that problem P also applied to other approaches.)

2. Understand what you mean by grader-optimizers, and then figure out which of the two steps of your argument I disagree with, so that we can focus on that subclaim instead. (Strategy for most of this comment thread.)

Unfortunately, I don’t think I have a sufficient definition (intensional or extensional) of grader-optimizers to say which of the two steps I disagree with. I don’t have a coherent concept in my head that says your pseudocode isn’t a grader-optimizer and approval-directed agents are grader-optimizers. (The closest is the “grader is complicit” thing, which I think probably could be made coherent, but it would say that your pseudocode isn’t a grader-optimizer and is agnostic /​ requires more details for approval-directed agents.)

In my previous comment I switched back from strategy 2 to strategy 1 since that seemed more relevant to your response but I should have signposted it more, sorry about that.

• But how did you do that? Why didn’t we instead get “almost-value-child”, who (say) values doing challenging things that require hard work, and so enrolls in harder and harder courses and gets worse and worse grades?”)

I think the truth is even worse than that, in that deceptively aligned value children are the default scenario without myopia and solely causal decision theory or variants of it, and Turntrout either has good arguments for why this isn’t favored, or Turntrout hopes that deceptive alignment is defined away.

For reasons why deceptive alignment might be favored, I have a link below:

https://​​www.lesswrong.com/​​posts/​​A9NxPTwbw6r6Awuwt/​​how-likely-is-deceptive-alignment

• That grader-optimizer probably has a diamond-shard too, in order for “diamond-shard” to be represented as a concept that can influence its decisions, no? But to your point, it also has a much stronger “diamond-shard”-shard, which is responsible for driving it to make decisions based on thoughts about the outputs of its diamond-shard, as opposed to decisions on the basis of thoughts about diamonds.

• Having a go at extracting some mechanistic claims from this post:

• A value x is a policy-circuit, and this policy circuit may sometimes respond to a situation by constructing a plan-grader and a plan-search.

• The policy-circuit executing value x is trained to construct <plan-grader, plan-search> pairs that are ‘good’ according to the value x, and this excludes pairs that are predictably going to result in the plan-search Goodharting the plan-grader.

• Normally, nothing is trying to argmax value x’s goodness criterion for <plan-grader, plan-search> pairs. Value x’s goodness criterion for <plan-grader, plan-search> pairs is normally just implicit in x’s method for constructing <plan-grader, plan-search> pairs.

• Value x may sometimes explicitly search over <plan-grader, plan-search> pairs in order to find pairs that score high according to a grader-proxy to value x’s goodness criterion. However, here too value x’s goodness criterion will be implicitly expressed in the policy-execution level as a disposition to construct a pair <grader-proxy to value x’s goodness criterion, search over pairs> that doesn’t Goodhart the grader-proxy to value x’s goodness criterion.

• The crucial thing is that the true, top level ‘value x’s goodness criterion’ is a property of an actor, not a critic.

• We do have empirical evidence that nonrobust aligned intelligence can be not OK, like this or this. Why are you not more worried about superintelligent versions of these (i.e. with access to galaxies worth of resources)?

• What do you mean by “nonrobust” aligned intelligence? Is “robust” being used in the “robust grading” sense, or in the “robust values” sense (of e.g. agents caring about lots of things, only some of which are human-aligned), or some other sense?

Anyways, responding to the vibe of your comment—I feel… quite worried about that? Is there something I wrote which gave the impression otherwise? Maybe the vibe of the post is “alignment admits way more dof than you may have thought” which can suggest I believe “alignment is easy with high probability”?

• When I talk about shard theory, people often seem to shrug and go “well, you still need to get the values perfect else Goodhart; I don’t see how this ‘value shard’ thing helps.”

I realize you are summarizing a general vibe from multiple people, but I want to note that this is not what I said. The most relevant piece from my comment is:

I don’t buy this as stated; just as “you have a literally perfect overseer” seems theoretically possible but unrealistic, so too does “you instill the direct goal literally exactly correctly”. Presumably one of these works better in practice than the other, but it’s not obvious to me which one it is.

In other words: Goodhart is a problem with values-execution, and it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don’t think you need to get the values perfect. I just also don’t think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.

• Goodhart is a problem with values-execution

I understand this to mean “Goodhart is and historically has been about how an agent with different values can do bad things.” I think this isn’t true. Goodhart concepts were coined within the grader-optimization/​argmax/​global-objective-optimization frame:

Throughout the post, I will use to refer to the true goal and use to refer to a proxy for that goal which was observed to correlate with and which is being optimized in some way.

This cleanly maps onto the grader-optimization case, where is the grader and is some supposed imaginary “true set of goals” (which I’m quite dubious of, actually).

This doesn’t cleanly map onto the value shard case. The AI’s shards cannot be , because they aren’t being optimized. The shards do the optimizing.

So now we run into a different regime of problems AFAICT, and I push back against calling this “Goodhart.” For one, e.g. extremal Goodhart has a “global” character, where imperfections in get blown up in the exponentially-sized plan space where many adversarial inputs lurk. Saying “value shards are vulnerable to Goodhart” makes me anticipate the wrong things. It makes me anticipate that if the shards are “wrong” (whatever that means) in some arcane situation, the agent will do a bad thing by exploiting the error. As explained in this post, that’s just not how values work, but it is how grader-optimizers often work.

While it’s worth considering what value-perturbations are tolerable versus what grader-perturbations are tolerable, I don’t think it makes sense to describe both risk profiles with “Goodhart problems.”

it is not clear which of values-execution and grader-optimization degrades more gracefully. In particular, I don’t think you need to get the values perfect. I just also don’t think you need to get the grader perfect in grader-optimization paradigms, and am uncertain about which one ends up being better.

I changed “perfect” to “robust” throughout the text. Values do not have to be “robust” against an adversary’s optimization, in order for the agent to reliably e.g. make diamonds. The grader does have to be robust against the actor, in order to force the actor to choose an intended plan.

• As I understand Vivek’s framework, human value shards explain away the need to posit alignment to an idealized utility function. A person is not a bunch of crude-sounding subshards (e.g. “If food nearby and hunger>15, then be more likely to go to food”) and then also a sophisticated utility function (e.g. something like CEV). It’s shards all the way down, and all the way up.[10]

This read to me like you were saying “In Vivek’s framework, value shards explain away ..” and I was confused. I now think you mean “My take on Vivek’s is that value shards explain away ..”. Maybe reword for clarity?

(Might have a substantive reply later)

• Question: How do we train an agent which makes lots of diamonds, without also being able to robustly grade expected-diamond-production for every plan the agent might consider?

I thought you were about to answer this question in the ensuing text, but it didn’t feel like to me you gave an answer. You described the goal (values-child), but not how the mother would produce values-child rather than produce evaluation-child. How do you do this?

• Ah, I should have written that question differently. I meant to ask “If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?”

How do you do this?

Anyways, we might train a diamond-values agent like this.

• I’m focusing on the code in Appendix B.

What happens when self.diamondShard’s assessment of whether some consequences contain diamonds differs from ours? (Assume the agent’s world model is especially good.)

• The same thing which happens if the assessment isn’t different from ours—the agent is more likely to take that plan, all else equal.

• upweights actions and plans that lead to

how is it determined what the actions and plans lead to?

• See the value-child speculative story for detail there. I have specific example structures in mind but don’t yet know how to compactly communicate them in an intro.

• In your view, are values closer to recipes/​instruction manuals than to utility functions?

• Yeah, kinda, although “recipe” implies there’s something else deciding to follow the recipe. Values are definitely not utility functions, on my view.

• I agree that something like this is plausible. To be more specific:

My favorite terms for it, which I mainly use in private thoughts and haven’t really written up much about elsewhere, is covariant concepts vs contravariant concepts. If you are familiar with type systems or category theory, you can think of it as being that sense. Basically, a contravariant concept is defined by something like a utility function or a boolean predicate or a set of constraints, while a covariant concept is defined by something like a distribution/​measure or a convex set or a schematic. Covariant = output, contravariant = input, essentially. Or in terms of AI, generative models vs classifiers.

It does seem like some concepts are very anti-natural to define in a contravariant way, and very natural to define in a covariant way. In fact I would dare to go further and say that contravariant concepts usually cannot really be defined without a library of covariant concepts to define them in terms of, because you need something to “grab onto” reality, and contravariant concepts can’t really do that. This makes it seem quite plausible that covariance will play a critical role in aligned AI.

Some further things:

• A simple way to handle covariant concepts is to take them as primitive. This is what e.g. generative latent variable models tend to do. The trouble is that in the real world, they are not primitive, but instead reduce into smaller-scale components. Sometimes we want the AIs to respect the sanctity of those components (e.g. don’t wirehead people) while other times we want the AI to intervene in the components (e.g. do cure aging). (And sometimes it’s controversial what we want it to do, e.g. with em uploads.) Taking them as primitive on one level of analysis does not really support novel interventions on the lower levels of analysis, nor does it really support respecting their sanctity on a higher level of analysis.

• Covariance and contravariance is not totally absolute. The point of search, for instance, is basically to convert contravariant things to covariant ones.

• A program that can act in the world to achieve good things would be a covariant representation of goodness. However, this program needs to be created in some way, and the world seems much to complex for engineers to handle each part. Instead it seems that we need some sort of learning process to capture it. But I have trouble thinking of learning processes that can plausibly effectively learn novel useful things without involving contravariant reasoning somewhere along the line. So I don’t think the trouble with contravariance can be avoided for alignment.

Not sure if this aligns 100% with your way of thinking about things but the framing might be helpful?