Failure modes in a shard theory alignment plan

Thanks to David Udell, Alex Turner, and others for conversations that led to this

Recently in a conversation with David Udell, Nate Soares said:

[David’s summary of shard theory] doesn’t yet convince me that you know something i don’t about the hopefullness of such a plan. the sort of summary that might have that effect is a summary of what needs to be true about the world (in this case, probably about RL models, optimization, and/or human values) for this idea to have hope. in particular, the point where i start to be interested in engaging is the point where it seems to me like you perceive the difficulties and have a plan that you expect to overcome those difficulties.”

I’m not as pessimistic as Nate, but I also think shard theory needs more concreteness and exploration of failure modes, so as an initial step I spent an hour brainstorming with David Udell, with each of us trying to think of difficulties, and then another couple of hours writing this up. Our methodology was to write down a plan for the sake of concreteness (even if we think it’s unlikely to work), then try to identify as many potential difficulties as possible.

Definitions for this document

shard theory is a new alignment research program based on observations about the human reward system; the canonical reference is here.
shard means a contextually-activated circuit responsible for some behaviors. Shards can perform powerful optimization that steers the world in certain directions; rather than suggestively calling these directions values, I just call them behavioral tendencies.
corrigibility means a system that will allow itself to be modified by humans, and perhaps be actively helpful in various ways
reward means something that applies selection to a model, and in particular reinforces and attenuates shards.
values means behavior patterns that are stable under reflection, e.g. a utility function. The utility function need not be explicitly represented or precise, nor be over world-states.
value formation is the process by which humans or AIs construct values from their existing behavioral tendencies.^[1]

A possible shard theory alignment plan

Play around with modern RL models, and extract quantitative relationships between reward schedule and learned behaviors.
Instill corrigibility shards inside powerful RL models to the greatest extent possible.
Scale those aligned models up to superintelligent agents, and allow value formation to happen.

Note that this is just one version of the shard theory alignment plan; other versions might replace RL with large language models or other systems.

Some requirements for this plan to work

Play around with modern RL models, and extract quantitative relationships between reward schedule and learned behaviors.
- Thomas: Modern RL has a reward schedule → behavioral tendency map that is not hopelessly complicated.
- T: We have enough transparency tools + data to invert this map to robustly produce shards that steer the world in directions we know how to specify.
- David: Modern RL is powerful enough to have interesting, desirable shards.
- D: We are able to find quantitative relationships between, e.g. abstractions present in pretraining and strength of related learned behavioral tendencies.
- D: The quantitative relations we observe hold generally for many RL hyperparameter setups, not just narrowly for weak RL models.
Instill corrigibility shards inside powerful RL models to the greatest extent possible.
- T: Powerful RL architectures are not so alien as to have huge inductive biases against corrigibility (or whatever nice property).
- T: Similar mechanisms for getting good shards to form on modern RL work on more powerful models.
- T: Training competitiveness: we have enough training time / environments to provide sufficient selection pressure towards systems we want.
- D: The relations between training variables and learned behavioral tendencies are manipulable levers sufficient to specify corrigibility.
- D: Corrigibility isn’t inordinately hard to pinpoint and can be learned via the training manipulations we’ll have access to.
Scale those aligned models up to superintelligent agents, and allow value formation to happen.
- D: There’s no dramatic phase transition away from the “shard” abstraction when models become superintelligent.
- D: We successfully pinpoint a set of target shards—e.g. corrigibility shards—that we wish to install and thereby initiate a pivotal act/process with.
- D: We still have adequate interpretability tools to be confident we’re getting the shard development we’re after, and not just e.g. a deceptive model playing along.
- D: Shards with human values generalize OOD to the superintelligent domain and actually reshape the world as we want.
- T: Predictable value formation (e.g. intershard game theory) happens before unpredictable value formation (e.g. alien philosophical reflection) in powerful models.
  - D: Scaling an aligned RL model to superintelligence doesn’t result in a few of the shards killing off the others and gaining complete control.
- T: The predictable value formation process we identify can reliably produce a superintelligence that values some concept X from human-identifiable shards that define behavioral tendencies pointing towards X.
- T: The predictable value formation process we identify can scale “corrigibility” to a reflectively consistent superintelligence, despite corrigibility maybe not being shaped like a utility function.^[2]
- T: The process of selecting for corrigibility shards doesn’t ruin performance competitiveness. One way this could fail: we end up with a superintelligence that is ruled by a vote between “alien optimization for a targetable goal” shards and “don’t optimize against the humans” shards. If we succeed at all the other problems and get a sufficiently high proportion of “don’t optimize against the humans” shards, we might not get anything useful out of the AI because with more powerful optimization, a lower and lower proportion of plans will be accepted.

Exercise for the reader: Which of these persist with simulator-based plans like GEM?

Opinions

Note: I’ve thought about this less than I would like and so am fairly unconfident, but I’m posting it anyway

In a non-shard theory frame like Risks from Learned Optimization, we decompose the alignment problem into outer alignment (finding an outer objective aligned with human values) and inner alignment (finding a training process such that an AI’s behavioral objective matches the outer objective).

In the shard theory frame, the observations that reward is not the optimization target motivates a different decomposition: our reward signal no longer has to be an outer objective that we expect the AI to be aligned with. But my understanding is that we still have to:

find a reliable mapping from reward schedules to behavioral shards (replaces inner alignment)
describe behavioral tendencies we want to instill (sort of replaces outer alignment)^[3]
induce a predictable value formation process that scales to superintelligence (sort of replaces outer alignment)

These problems seem pretty hard, as evidenced by the above gaps in the example shard theory plan, and it’s not clear that they’re easier than inner and outer alignment. Some analogies from humans imply that various core alignment problems really are easy, but there are also reasons why they wouldn’t be.

First is understanding the reward → behavioral shard map. In the RLO frame, we have an outer loss function that updates the agent in almost the correct direction, but has some failure modes like deceptive alignment. In a world where models were perfect Bayesian samplers with no path-dependence, the ideal reward function would just be performance on an outer objective. Every departure from the perfect Bayesian sampler implies, in theory, some way to better shape behavior than just using an outer objective. Despite this advantage over the “inner alignment” framing, some of the problems with inner alignment remain.

If we view the inner alignment problem as distinguishing functions that are identical on the training distribution, it becomes clear that shard theory has not dissolved inner alignment. Selecting on behavior alone is not sufficient to guarantee inner alignment^[4], and for the same reason, we will need good transparency or process-based feedback (as part of our knowledge about the reward → shards map) to reliably induce behaviors we want.

Work can be traded off between specifying desired behavioral tendencies and value formation, and the part that happens at subhuman capability seems doable. I’ll assume that specifying behavioral tendencies happens at subhuman level and value formation is done as the system scales to superhuman level.

In my opinion, the main hope of shard theory is the analogy to humans: human values are somewhat reliably produced by the human reward system, despite the reward system not acting like an outer objective. But when we consider the third subproblem, value formation, the analogy breaks down. Human value formation seems really complex, and it’s not clear that human values can be fully described by game-theoretic negotiations between fully agentic, self-preserving shards associated with your different behavioral tendencies.^[5]

Corrigibility might be easier to learn, but it’s still the case that only some ways shards could exist in a mind cause its goals to scale correctly to superintelligence.^[6] For example, if shards are just self-preserving circuits that encode behaviors activated by certain observations, then when the agent goes OOD, the observations that prompted the agent to activate the shard (e.g. be helpful to humans) are no longer present. Or, if shards have goals and a shard doesn’t prevent its goals from being modified, then its goals will be overwritten. Or, if training causes the corrigibility shards to be limited in intelligence whereas shards with other goals keep getting smarter, the corrigibility shards will eventually lose influence over the mind. If there are important forces in value formation other than internal competition and negotiation between self-preserving shards (which seems highly likely given how humans work), there are even more failure modes, which is why I think a predictable value formation method is key.

^
In reality, there is a continuum of coherence levels between behavioral tendencies and values.
^
I think corrigibility is natural iff robust pointers to it can easily get into the AI’s goals, and it’s not clear whether this is the case—this is a disagreement between Eliezer and Paul.
^
and maybe “be able to identify” as well—depends how reliable the mapping from (1) is
^
unless you can get a really strong human prior, which is where the simulators hope comes from
^
I think I care about animal suffering due to some combination of (a) it’s high status in my culture; (b) I did some abstract thinking that formed a similarity between animal suffering and human suffering, and I already decided I care about humans; (c) I wanted to “have moral clarity” (whatever that means), went vegan for a month, and decided that the version of me without associations between animals and food had better moral intuitions. It’s not as simple as an “animal suffering bad” shard in my brain outcompeting “animal suffering okay” shards.
^
I have reasons outside the scope of this post why the particular subagent models shard theory have been using seem unlikely.