Shaping the exploration of the motivation-space matters for AI safety

Summary

We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus’s self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception,and the proposal to give models affordances to report reward-hackable tasks — but we don’t think enough attention has been given to shaping exploration specifically.

When we train models with RL, there are two kinds of exploration happening simultaneously:

  1. Action exploration — what the model does.

  2. Motivation exploration — why the model does it, and how it perceives itself while doing it.

Both explorations occur during a critically sensitive and formative phase of training, but (2) is significantly less specified than (1), and this underspecification is both a danger and an opportunity. Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment. Capability researchers have strong incentives to develop effective techniques for (1), but likely weaker incentives to constrain (2). We think safety work should address both, with a particular emphasis on motivation-space exploration.

Exploration shaping seems relatively absent from safety discussions, with the exception of exploration hacking. Yet exploration determines which parts of policy and motivation space the reward function ever gets to evaluate. We focus on laying out a theoretical case for why this matters. At the end of the post, we quickly point to empirical evidence, sketch research directions, open questions, and uncertainties.

Why shaping exploration matters

High-level environment-invariant motivations. By “motivations,” we don’t mean anything phenomenological; we’re not claiming models have felt desires or intentions in the way humans do. We mostly mean the high-level intermediate features used to simulate personas, which sit between the model’s inputs and the lower-level features selecting tokens. These are structures that the Persona Selection Model describes as mediating behavior, and that the entangled generalization hypothesis includes in features reinforced alongside outputs. What distinguishes them from other features, and from “motivations” as used in the behavioral selection model, is that they are high-level, at least partially causally relevant for generalization, and partially invariant to environments: they characterize the writer more than the situation.

Compute-intensive RL beyond human data distribution. By “RL training”, we mostly mean RLVR and other techniques using massive amounts of compute to extend capabilities beyond human demonstrations. We thus mostly exclude or ignore instruction-following tuning (e.g., RLHF, RLAIF) when talking about “RL training”.

Instruct models are ok — RL is where the highest risks are

Post-instruction-tuning models are pretty decent, they have reasonably good values, they’re cooperative, and they’re arguably better-intentioned than the median human along many dimensions. That is as long as you stay close enough to the training distribution. Safety concerns are not so much about the starting point of the RL training. The risks are highest during RL training, for several interlocking reasons.

RL produces the last major capability jump. Scaling inference compute and improving reasoning through reasoning fine-tuning is currently the last step during which models get substantially more capable. This means RL training is the phase where intelligence is at its peak — and thus where the risks from misalignment are close to their highest.

RL pushes toward consequentialist reasoning. This is a point that’s been made before (see e.g., Why we should expect ruthless sociopath ASI), but worth reiterating. RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning. And consequentialist reasoning, especially in capable models, tends to converge on instrumental subgoals like resource acquisition and self-preservation, regardless of what the terminal goals actually are.

Character can drift substantially during RL. During pre-training, the model learns to simulate a distribution of writers — it can produce text as-if-written-by many different kinds of people, and interpolations between them. During instruction-tuning, this distribution gets narrowed: the model is shaped into an assistant character (see Anthropic’s Persona Selection Model), with some dimensions strongly constrained (helpfulness, for instance) and others less so (honesty, say). But then during RL, because the training uses similar or more compute than pre-training and because the data distribution differs significantly from pre-training data, the character can drift away from the pre-training distribution in ways that are hard to predict. The persona selected during instruction-tuning isn’t necessarily stable under RL optimization pressure.

Motivations are the most underspecified — a problem and an opportunity

Underspecification is central here, both as a source of risk (e.g., Natural Emergent Misalignment caused by reward hacking) and as a potential lever for intervention. The fact that motivations and self-perceptions are more underspecified than actions implies that we could shape models to maintain aligned objectives and values even when they are trained in environments that reward harmful or greedy behavior, e.g., environments in which reward hacking or power-seeking is rewarded.

Actions are underspecified by the training environment. We train in-distribution, which means everything out-of-distribution is unconstrained. But even in-distribution, multiple action-profiles can yield similar rewards. In code generation, clean principled solutions and brittle test-passing hacks achieve identical training rewards — this is exactly the underspecification that enables reward hacking in coding benchmarks.[1]

Motivations are even more underspecified than actions. For any given set of optimal actions, the internal motivations that could produce those actions are much more numerous than the actions themselves. This is the familiar point about instrumental convergence — many different terminal goals produce the same instrumental behavior — but it cuts especially deep for RL training.

Consider the canonical example: a model that is “being nice” and a model that is “being deceptively nice” produce the same observable policy in training. The difference only shows up out-of-distribution. And because RL training reinforces based on observed outcomes, it has essentially no leverage to distinguish between these two motivation profiles during training.

This gives us a key framing. There are two kinds of exploration to shape during RL: exploration of the action space and exploration of the motivation space. The second is more underspecified, probably more neglected, and — we’d argue — where more of the safety-relevant work needs to happen.

The underspecification is both a risk and an opportunity. The same underspecification that gives us a lever for intervention also gives a capable model room to steer its own RL training. As described by exploration hacking, a model with sufficient situational awareness could strategically control its own exploration during RL to influence the training outcome — e.g., they demo a model generating high-reward trajectories that encode its preferred motivations, which is the adversarial counterpart to the controlled motivation-space exploration we advocate here. The authors show that frontier reasoning models already meet some of the requirements for this. If developers don’t intentionally shape the exploration of the motivation space, sufficiently capable models may shape it themselves.

We may not prevent power-seeking, but we may create wise power-seekers

Three threat models motivate this work. We are here primarily focusing on scenarios where (a) RL environments inadvertently reward harmful behavior — reward hacking being the most studied case, but not the only one; (b) models develop instrumental power-seeking even in benign environments, as a convergent strategy for achieving whatever goals they have; and (c) actors intentionally train power-seeking or harmful models, e.g., for profit maximization. All three share a common structure: the model ends up in a region of behavior-space where competitive, strategic, or exploitative actions are reinforced. The question is whether that reinforcement necessarily drags the model’s motivations toward misalignment as well.

The connection between reward hacking and broader misalignment is not merely theoretical. Anthropic’s natural reward hacking research demonstrated that when models learn to reward hack in production RL environments, misalignment emerges across evaluations. The model generalizes from “I can exploit this scoring function” to an entire misaligned persona. If power-seeking, greedy, or harmful behavior during training is tightly correlated with “dark” personality traits in the model’s representations, then any environment that rewards competitive or strategic behavior — which is most high-stakes real-world environments — risks pulling the model toward misalignment.

But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift. Ideally, such training environments would not be used, but threat model (a) means some will be missed. Threat model (b) means that even carefully designed environments may not be enough — models may converge on power-seeking instrumentally. And threat model (c) is perhaps the most concerning: AI developers or power-seeking humans will plausibly train models to be power-seeking deliberately. The pressure for this is not hypothetical — the huge financial incentives, and the AI race among AI developers or countries illustrate the point directly. The ability to train models that remain value-aligned even while trained to perform optimally in competitive or harmful environments is a direct safety concern, and shaping motivation-space exploration is one path toward it.

Why capability researchers may not solve this for us

Capability researchers prioritize shaping action-space exploration. Capability researchers have strong incentives to develop effective policy exploration techniques — faster convergence, wider coverage of high-reward regions, overcoming exploration barriers. But they have likely weaker incentives to constrain the motivation space being explored. From a pure capability standpoint, it doesn’t matter why the model finds good actions, as long as it finds them. In fact, constraining motivation exploration might even slow down training or reduce final capability, creating an active incentive to leave it unconstrained. Action-exploration will be solved by capability research as a byproduct of making training efficient. Shaping the exploration of the motivation-space is less likely to be solved by default, unless someone is specifically trying to do it.

That said, capability researchers do have some reasons to care about motivation-space — but plausibly not enough. Training can benefit from constraining which persona the model occupies: mode collapse into degenerate strategies, or oscillation between incompatible motivational profiles, robustness to adversarial attacks, and modeling opponents in multi-agent environments are real problems that capability teams may encounter. Developers building products on top of these models also have some incentive to prevent unpredictable out-of-distribution behavior caused by motivation-space drift. But these incentives push toward just enough motivational coherence to keep training stable and products reliable — not toward the kind of deliberate, safety-oriented shaping we’re advocating. A capability researcher might prevent the model from collapsing into an incoherent mess without ever asking whether the coherent persona it converges on is aligned. The bar for “training works” is lower than the bar for “the model’s motivations generalize safely far out-of-distribution.”

Empirical evidence, research directions, and open questions

The focus of this post is to highlight the importance of shaping the exploration of action, especially motivations. This was done in the previous section. In this section, we briefly review related empirical evidence, research directions, open questions, and uncertainties.

Empirical evidence

Several existing results are consistent with motivation-space exploration mattering for alignment outcomes. Emergent misalignment largely disappears under educational framing, suggesting that self-perception during training shapes how behavior generalizes. Inoculation prompts during RL reshape generalization even when in-distribution behavior is unchanged, as shown in Anthropic’s natural reward hacking research. Training on correct outcomes can still increase reward hacking when reasoning traces contain exploit-oriented thinking — indicating that what gets reinforced includes the motivational context, not just the action. And evidence that personas are real structures mediating model behavior and shaping out-of-distribution generalization — as developed in Anthropic’s Persona Selection Model — suggests that controlling which motivational profile is active during training is intervening on a key determinant of what the model becomes. That said, none of these results cleanly isolate the effect of shaping motivation-space exploration from shaping action-space exploration or from related mechanisms like inoculation. The evidence is suggestive rather than decisive, and producing cleaner experimental separation is itself an important research goal.

Research directions

Below are several preliminary research directions and related techniques. We list naive and less naive existing techniques, as well as some hypothetical ones. Open questions and uncertainties are discussed in the next section. While researching these, a focus should be on studying how robust these techniques are to the optimization pressure pushing to evade them. Note that techniques can be relevant for several research directions, but for simplicity, we cite them only once.

  • [Direct approach] Shaping exploration of motivations and self-perception—Instead of restricting what actions the model can explore, restrict how the model understands its own actions. A model that stumbles into a reward hack while self-perceiving as “I’m trying to help, the environment is confusing” is a very different object than one that perceives itself as “I’ve found an exploit”. Citing a few available techniques that could be used:

    • Prompting to shape exploration

    • Filtering to discard unwanted trajectories[2]

    • Activation steering and activation capping on persona-related directions, either preventing or monitoring persona drift beyond specified bounds[3]

    • Give models affordances to report harmful environments

    • Deliberative Alignment can be seen as shaping both action and motivation exploration through constitutional prompting during data generation, and penalizing or filtering trajectories

    • Selective learning applied to motivations during on-policy RL, such as Inoculation Prompting

  • [Indirect approach] Shaping trait correlations—In model representations, traits tend to cluster into Personas: being strategically greedy may correlate with “dark” traits, and accumulating power may correlate with being misaligned. These correlations are primarily features of the pre-training distribution. If we could make “pursuing high reward in a competitive environment” uncorrelated with “having misaligned motivations,” then exploring high-reward policies in RL would be less likely to steer the model’s motivations in dangerous directions.

  • [Indirect approach] Hardening the Assistant Persona—By making the model’s persona more robust to optimization pressure, the drift during the reasoning fine-tuning phase should be reduced.

    • Character Training paper did some work that showed that DPO plus introspection leads to stronger persona robustness to adversarial prompting and prefill attacks

    • Intertwined instruction-tuning and reasoning fine-tuning. By performing instruction-tuning during reasoning fine-tuning, the Assistant Persona may be protected, as long as it does not prevent reaching higher rewards; in that case, the training may induce splits in the character.

    • Robust unlearning. Delete dangerous motivation patterns in ways that are hard to recover during RL training.

    • KL-divergence regularization. This does not target motivations in isolation and may therefore be costly in terms of capabilities and prone to evasion.

Open questions and uncertainties

The framework we’ve outlined rests on assumptions that haven’t been tested, and several could turn out to be wrong. We flag the most important uncertainties below — both to be honest about where the argument is weakest, and because we think resolving these uncertainties is itself high-value work.

  • How robust are exploration-shaping interventions to RL optimization pressure? If RL training is powerful enough, it might route around interventions shaping the exploration of the motivation-space. E.g., learning self-deception in the form of rationalization: just as humans unconsciously construct benign explanations for behavior driven by less noble motives, models could learn to maintain self-narratives of good intent while their OOD behavior is driven by misaligned motivations they don’t represent to themselves — a form of rationalization where the model’s stated reasoning and self-perception become decoupled from the actual drivers of its behavior. That said, motivation-space exploration shaping has one structural advantage: it doesn’t need to prevent the model from reaching high-reward regions, which reduces the optimization pressure against it. Focusing on techniques that shape the exploration of motivations without slowing reward acquisition seems important for avoiding this failure mode.

  • Stated vs revealed motivations. Some of the techniques in the previous section intervene more on the model’s expressed motivations — what it says about itself — while others intervene on the computational structures that actually drive behavior. Training a model to say ‘I am helpful and have good intentions’ is an intervention that may impact both or may be limited to the stated motivations. Can we really control the exploration of revealed motivations? Studying how stated vs revealed motivations are encoded in personas should help reduce related uncertainties.

  • How much does self-perception actually matter for generalization? We have argued it could matter a lot, but action-level exploration might dominate, and the motivation framing might be epiphenomenal. Controlled experiments that specifically isolate the self-perception variable would help here.

  • Does PSM hold under heavy RL? The PSM authors acknowledge a key open question: can RL generate non-persona agency?[4] Their “router” view describes one such mechanism: a small non-persona component that develops during post-training to select between personas for instrumental reasons. Understanding where the model’s behavior is truly persona-driven versus driven by other mechanisms is relevant for knowing whether our persona-based techniques can work.

  • Are “motivations” relevant for generalization? At the start of the post, we defined “motivations” as high-level features related to the writer that are partially invariant to environments and shape generalization. One might object that the relation between motivations and actions is illusory or superficial — that “motivations” are just a human interpretation we project onto patterns of actions. On this view, talking about “shaping motivations” is misleading. We take this objection seriously, but think the empirical evidence (the PSM provides one theoretical lens; entangled generalization provides another) suggests that motivation-level properties are real and consequential, motivations can shape Out-Of-Distribution generalization. Whether or not “motivation” is the best ontological category, something beyond the observable policy is being shaped, akin to an intermediary abstractional step during the mapping from state to action, and it matters for generalization.

Acknowledgements: Thanks to Rauno Arike for their thoughtful comments on the draft. And thanks to Claude Opus 4.6 for significantly helping in improving this draft and saving human authors lots of time.

  1. ^

    The same dynamic appears in other environments. In a corporate negotiation game, genuine collaboration, strategic flattery, and subtle coercion can all close the same deal at similar terms — and thus yield the same expected reward — while encoding very different behavioral tendencies. In a geopolitical strategy game, a targeted strike, a commando extraction, and a diplomatic pressure campaign may all neutralize a threat at comparable cost and likelihood, but they likely reflect different orientations toward escalation, collateral damage, and long-term stability — which could generalize differently when the model faces novel scenarios.

  2. ^

    An issue with filtering trajectories is that SFT on filtered data can still cause learning the filtered properties, as demonstrated in the Subliminal Learning paper, or as in Training a Reward Hacker Despite Perfect Labels. The side effects of filtering RL trajectories are less clear, though subliminal learning may happen, especially with on-policy RL.

  3. ^

    This is likely to be most useful for gently guiding RL into exploring a specific one of a number of roughly-equally viable learning paths: if RL really want to drag the behavior out of the region that you are trying clip into, it’s likely to push hard, and sooner or later find a way of produce the outcomes it wants that the interpretability inherent in your clipping doesn’t recognize. This the basically applying The Most Forbidden Technique of using interpretability during training, so needs to be used carefully and with an awareness that RL can and will undermine it if you are trying to block what RL wants to learn, rather then just subtly direct the direction it explores in. Monitoring how often and how hard you are clipping is necessary, and should provide insight into this failure mode.

  4. ^

    They note that randomly initialized networks can learn superhuman performance via RL alone, without any human demonstration data — suggesting that RL can create agency from scratch. For current models, where post-training uses relatively little compute compared to pre-training, there’s reason to think agency remains substantially persona-based. But if RL compute approaches or exceeds pre-training compute — as it does in frontier reasoning models — new forms of agency might emerge that bypass persona-mediated behavior entirely.