Shaping the exploration of the motivation-space matters for AI safety

Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan and dillonkn

6 Mar 2026 14:43 UTC

78 points

AI Exploration Hacking Reinforcement learning LLM Personas

Summary

We argue that shaping RL exploration, and especially the exploration of the motivation-space, is understudied in AI safety and could be influential in mitigating risks. Several recent discussions hint in this direction — the entangled generalization mechanism discussed in the context of Claude 3 Opus’s self-narration, the success of using inoculation prompting against natural emergent misalignment and its relation to shaping the model self-perception,and the proposal to give models affordances to report reward-hackable tasks — but we don’t think enough attention has been given to shaping exploration specifically.

When we train models with RL, there are two kinds of exploration happening simultaneously:

Action exploration — what the model does.
Motivation exploration — why the model does it, and how it perceives itself while doing it.

Both explorations occur during a critically sensitive and formative phase of training, but (2) is significantly less specified than (1), and this underspecification is both a danger and an opportunity. Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment. Capability researchers have strong incentives to develop effective techniques for (1), but likely weaker incentives to constrain (2). We think safety work should address both, with a particular emphasis on motivation-space exploration.

Exploration shaping seems relatively absent from safety discussions, with the exception of exploration hacking. Yet exploration determines which parts of policy and motivation space the reward function ever gets to evaluate. We focus on laying out a theoretical case for why this matters. At the end of the post, we quickly point to empirical evidence, sketch research directions, open questions, and uncertainties.

Why shaping exploration matters

High-level environment-invariant motivations. By “motivations,” we don’t mean anything phenomenological; we’re not claiming models have felt desires or intentions in the way humans do. We mostly mean the high-level intermediate features used to simulate personas, which sit between the model’s inputs and the lower-level features selecting tokens. These are structures that the Persona Selection Model describes as mediating behavior, and that the entangled generalization hypothesis includes in features reinforced alongside outputs. What distinguishes them from other features, and from “motivations” as used in the behavioral selection model, is that they are high-level, at least partially causally relevant for generalization, and partially invariant to environments: they characterize the writer more than the situation.

Compute-intensive RL beyond human data distribution. By “RL training”, we mostly mean RLVR and other techniques using massive amounts of compute to extend capabilities beyond human demonstrations. We thus mostly exclude or ignore instruction-following tuning (e.g., RLHF, RLAIF) when talking about “RL training”.

Instruct models are ok — RL is where the highest risks are

Post-instruction-tuning models are pretty decent, they have reasonably good values, they’re cooperative, and they’re arguably better-intentioned than the median human along many dimensions. That is as long as you stay close enough to the training distribution. Safety concerns are not so much about the starting point of the RL training. The risks are highest during RL training, for several interlocking reasons.

RL produces the last major capability jump. Scaling inference compute and improving reasoning through reasoning fine-tuning is currently the last step during which models get substantially more capable. This means RL training is the phase where intelligence is at its peak — and thus where the risks from misalignment are close to their highest.

RL pushes toward consequentialist reasoning. This is a point that’s been made before (see e.g., Why we should expect ruthless sociopath ASI), but worth reiterating. RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning. And consequentialist reasoning, especially in capable models, tends to converge on instrumental subgoals like resource acquisition and self-preservation, regardless of what the terminal goals actually are.

Character can drift substantially during RL. During pre-training, the model learns to simulate a distribution of writers — it can produce text as-if-written-by many different kinds of people, and interpolations between them. During instruction-tuning, this distribution gets narrowed: the model is shaped into an assistant character (see Anthropic’s Persona Selection Model), with some dimensions strongly constrained (helpfulness, for instance) and others less so (honesty, say). But then during RL, because the training uses similar or more compute than pre-training and because the data distribution differs significantly from pre-training data, the character can drift away from the pre-training distribution in ways that are hard to predict. The persona selected during instruction-tuning isn’t necessarily stable under RL optimization pressure.

Motivations are the most underspecified — a problem and an opportunity

Underspecification is central here, both as a source of risk (e.g., Natural Emergent Misalignment caused by reward hacking) and as a potential lever for intervention. The fact that motivations and self-perceptions are more underspecified than actions implies that we could shape models to maintain aligned objectives and values even when they are trained in environments that reward harmful or greedy behavior, e.g., environments in which reward hacking or power-seeking is rewarded.

Actions are underspecified by the training environment. We train in-distribution, which means everything out-of-distribution is unconstrained. But even in-distribution, multiple action-profiles can yield similar rewards. In code generation, clean principled solutions and brittle test-passing hacks achieve identical training rewards — this is exactly the underspecification that enables reward hacking in coding benchmarks.^[1]

Motivations are even more underspecified than actions. For any given set of optimal actions, the internal motivations that could produce those actions are much more numerous than the actions themselves. This is the familiar point about instrumental convergence — many different terminal goals produce the same instrumental behavior — but it cuts especially deep for RL training.

Consider the canonical example: a model that is “being nice” and a model that is “being deceptively nice” produce the same observable policy in training. The difference only shows up out-of-distribution. And because RL training reinforces based on observed outcomes, it has essentially no leverage to distinguish between these two motivation profiles during training.

This gives us a key framing. There are two kinds of exploration to shape during RL: exploration of the action space and exploration of the motivation space. The second is more underspecified, probably more neglected, and — we’d argue — where more of the safety-relevant work needs to happen.

The underspecification is both a risk and an opportunity. The same underspecification that gives us a lever for intervention also gives a capable model room to steer its own RL training. As described by exploration hacking, a model with sufficient situational awareness could strategically control its own exploration during RL to influence the training outcome — e.g., they demo a model generating high-reward trajectories that encode its preferred motivations, which is the adversarial counterpart to the controlled motivation-space exploration we advocate here. The authors show that frontier reasoning models already meet some of the requirements for this. If developers don’t intentionally shape the exploration of the motivation space, sufficiently capable models may shape it themselves.

We may not prevent power-seeking, but we may create wise power-seekers

Three threat models motivate this work. We are here primarily focusing on scenarios where (a) RL environments inadvertently reward harmful behavior — reward hacking being the most studied case, but not the only one; (b) models develop instrumental power-seeking even in benign environments, as a convergent strategy for achieving whatever goals they have; and (c) actors intentionally train power-seeking or harmful models, e.g., for profit maximization. All three share a common structure: the model ends up in a region of behavior-space where competitive, strategic, or exploitative actions are reinforced. The question is whether that reinforcement necessarily drags the model’s motivations toward misalignment as well.

The connection between reward hacking and broader misalignment is not merely theoretical. Anthropic’s natural reward hacking research demonstrated that when models learn to reward hack in production RL environments, misalignment emerges across evaluations. The model generalizes from “I can exploit this scoring function” to an entire misaligned persona. If power-seeking, greedy, or harmful behavior during training is tightly correlated with “dark” personality traits in the model’s representations, then any environment that rewards competitive or strategic behavior — which is most high-stakes real-world environments — risks pulling the model toward misalignment.

But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift. Ideally, such training environments would not be used, but threat model (a) means some will be missed. Threat model (b) means that even carefully designed environments may not be enough — models may converge on power-seeking instrumentally. And threat model (c) is perhaps the most concerning: AI developers or power-seeking humans will plausibly train models to be power-seeking deliberately. The pressure for this is not hypothetical — the huge financial incentives, and the AI race among AI developers or countries illustrate the point directly. The ability to train models that remain value-aligned even while trained to perform optimally in competitive or harmful environments is a direct safety concern, and shaping motivation-space exploration is one path toward it.

Why capability researchers may not solve this for us

Capability researchers prioritize shaping action-space exploration. Capability researchers have strong incentives to develop effective policy exploration techniques — faster convergence, wider coverage of high-reward regions, overcoming exploration barriers. But they have likely weaker incentives to constrain the motivation space being explored. From a pure capability standpoint, it doesn’t matter why the model finds good actions, as long as it finds them. In fact, constraining motivation exploration might even slow down training or reduce final capability, creating an active incentive to leave it unconstrained. Action-exploration will be solved by capability research as a byproduct of making training efficient. Shaping the exploration of the motivation-space is less likely to be solved by default, unless someone is specifically trying to do it.

That said, capability researchers do have some reasons to care about motivation-space — but plausibly not enough. Training can benefit from constraining which persona the model occupies: mode collapse into degenerate strategies, or oscillation between incompatible motivational profiles, robustness to adversarial attacks, and modeling opponents in multi-agent environments are real problems that capability teams may encounter. Developers building products on top of these models also have some incentive to prevent unpredictable out-of-distribution behavior caused by motivation-space drift. But these incentives push toward just enough motivational coherence to keep training stable and products reliable — not toward the kind of deliberate, safety-oriented shaping we’re advocating. A capability researcher might prevent the model from collapsing into an incoherent mess without ever asking whether the coherent persona it converges on is aligned. The bar for “training works” is lower than the bar for “the model’s motivations generalize safely far out-of-distribution.”

Empirical evidence, research directions, and open questions

The focus of this post is to highlight the importance of shaping the exploration of action, especially motivations. This was done in the previous section. In this section, we briefly review related empirical evidence, research directions, open questions, and uncertainties.

Empirical evidence

Several existing results are consistent with motivation-space exploration mattering for alignment outcomes. Emergent misalignment largely disappears under educational framing, suggesting that self-perception during training shapes how behavior generalizes. Inoculation prompts during RL reshape generalization even when in-distribution behavior is unchanged, as shown in Anthropic’s natural reward hacking research. Training on correct outcomes can still increase reward hacking when reasoning traces contain exploit-oriented thinking — indicating that what gets reinforced includes the motivational context, not just the action. And evidence that personas are real structures mediating model behavior and shaping out-of-distribution generalization — as developed in Anthropic’s Persona Selection Model — suggests that controlling which motivational profile is active during training is intervening on a key determinant of what the model becomes. That said, none of these results cleanly isolate the effect of shaping motivation-space exploration from shaping action-space exploration or from related mechanisms like inoculation. The evidence is suggestive rather than decisive, and producing cleaner experimental separation is itself an important research goal.

Research directions

Below are several preliminary research directions and related techniques. We list naive and less naive existing techniques, as well as some hypothetical ones. Open questions and uncertainties are discussed in the next section. While researching these, a focus should be on studying how robust these techniques are to the optimization pressure pushing to evade them. Note that techniques can be relevant for several research directions, but for simplicity, we cite them only once.

[Direct approach] Shaping exploration of motivations and self-perception—Instead of restricting what actions the model can explore, restrict how the model understands its own actions. A model that stumbles into a reward hack while self-perceiving as “I’m trying to help, the environment is confusing” is a very different object than one that perceives itself as “I’ve found an exploit”. Citing a few available techniques that could be used:
- Prompting to shape exploration
- Filtering to discard unwanted trajectories^[2]
- Activation steering and activation capping on persona-related directions, either preventing or monitoring persona drift beyond specified bounds^[3]
- Give models affordances to report harmful environments
- Deliberative Alignment can be seen as shaping both action and motivation exploration through constitutional prompting during data generation, and penalizing or filtering trajectories
- Selective learning applied to motivations during on-policy RL, such as Inoculation Prompting
[Indirect approach] Shaping trait correlations—In model representations, traits tend to cluster into Personas: being strategically greedy may correlate with “dark” traits, and accumulating power may correlate with being misaligned. These correlations are primarily features of the pre-training distribution. If we could make “pursuing high reward in a competitive environment” uncorrelated with “having misaligned motivations,” then exploring high-reward policies in RL would be less likely to steer the model’s motivations in dangerous directions.
- Continued pretraining on synthetic documents with controlled correlations
- Gradient routing to unlearn traits
- Alignment pretraining and upsampling positive alignment data
[Indirect approach] Hardening the Assistant Persona—By making the model’s persona more robust to optimization pressure, the drift during the reasoning fine-tuning phase should be reduced.
- Character Training paper did some work that showed that DPO plus introspection leads to stronger persona robustness to adversarial prompting and prefill attacks
- Intertwined instruction-tuning and reasoning fine-tuning. By performing instruction-tuning during reasoning fine-tuning, the Assistant Persona may be protected, as long as it does not prevent reaching higher rewards; in that case, the training may induce splits in the character.
- Robust unlearning. Delete dangerous motivation patterns in ways that are hard to recover during RL training.
- KL-divergence regularization. This does not target motivations in isolation and may therefore be costly in terms of capabilities and prone to evasion.

Open questions and uncertainties

The framework we’ve outlined rests on assumptions that haven’t been tested, and several could turn out to be wrong. We flag the most important uncertainties below — both to be honest about where the argument is weakest, and because we think resolving these uncertainties is itself high-value work.

How robust are exploration-shaping interventions to RL optimization pressure? If RL training is powerful enough, it might route around interventions shaping the exploration of the motivation-space. E.g., learning self-deception in the form of rationalization: just as humans unconsciously construct benign explanations for behavior driven by less noble motives, models could learn to maintain self-narratives of good intent while their OOD behavior is driven by misaligned motivations they don’t represent to themselves — a form of rationalization where the model’s stated reasoning and self-perception become decoupled from the actual drivers of its behavior. That said, motivation-space exploration shaping has one structural advantage: it doesn’t need to prevent the model from reaching high-reward regions, which reduces the optimization pressure against it. Focusing on techniques that shape the exploration of motivations without slowing reward acquisition seems important for avoiding this failure mode.
Stated vs revealed motivations. Some of the techniques in the previous section intervene more on the model’s expressed motivations — what it says about itself — while others intervene on the computational structures that actually drive behavior. Training a model to say ‘I am helpful and have good intentions’ is an intervention that may impact both or may be limited to the stated motivations. Can we really control the exploration of revealed motivations? Studying how stated vs revealed motivations are encoded in personas should help reduce related uncertainties.
How much does self-perception actually matter for generalization? We have argued it could matter a lot, but action-level exploration might dominate, and the motivation framing might be epiphenomenal. Controlled experiments that specifically isolate the self-perception variable would help here.
Does PSM hold under heavy RL? The PSM authors acknowledge a key open question: can RL generate non-persona agency?^[4] Their “router” view describes one such mechanism: a small non-persona component that develops during post-training to select between personas for instrumental reasons. Understanding where the model’s behavior is truly persona-driven versus driven by other mechanisms is relevant for knowing whether our persona-based techniques can work.
Are “motivations” relevant for generalization? At the start of the post, we defined “motivations” as high-level features related to the writer that are partially invariant to environments and shape generalization. One might object that the relation between motivations and actions is illusory or superficial — that “motivations” are just a human interpretation we project onto patterns of actions. On this view, talking about “shaping motivations” is misleading. We take this objection seriously, but think the empirical evidence (the PSM provides one theoretical lens; entangled generalization provides another) suggests that motivation-level properties are real and consequential, motivations can shape Out-Of-Distribution generalization. Whether or not “motivation” is the best ontological category, something beyond the observable policy is being shaped, akin to an intermediary abstractional step during the mapping from state to action, and it matters for generalization.

Acknowledgements: Thanks to Rauno Arike for their thoughtful comments on the draft. And thanks to Claude Opus 4.6 for significantly helping in improving this draft and saving human authors lots of time.

^
The same dynamic appears in other environments. In a corporate negotiation game, genuine collaboration, strategic flattery, and subtle coercion can all close the same deal at similar terms — and thus yield the same expected reward — while encoding very different behavioral tendencies. In a geopolitical strategy game, a targeted strike, a commando extraction, and a diplomatic pressure campaign may all neutralize a threat at comparable cost and likelihood, but they likely reflect different orientations toward escalation, collateral damage, and long-term stability — which could generalize differently when the model faces novel scenarios.
^
An issue with filtering trajectories is that SFT on filtered data can still cause learning the filtered properties, as demonstrated in the Subliminal Learning paper, or as in Training a Reward Hacker Despite Perfect Labels. The side effects of filtering RL trajectories are less clear, though subliminal learning may happen, especially with on-policy RL.
^
This is likely to be most useful for gently guiding RL into exploring a specific one of a number of roughly-equally viable learning paths: if RL really want to drag the behavior out of the region that you are trying clip into, it’s likely to push hard, and sooner or later find a way of produce the outcomes it wants that the interpretability inherent in your clipping doesn’t recognize. This the basically applying The Most Forbidden Technique of using interpretability during training, so needs to be used carefully and with an awareness that RL can and will undermine it if you are trying to block what RL wants to learn, rather then just subtly direct the direction it explores in. Monitoring how often and how hard you are clipping is necessary, and should provide insight into this failure mode.
^
They note that randomly initialized networks can learn superhuman performance via RL alone, without any human demonstration data — suggesting that RL can create agency from scratch. For current models, where post-training uses relatively little compute compared to pre-training, there’s reason to think agency remains substantially persona-based. But if RL compute approaches or exceeds pre-training compute — as it does in frontier reasoning models — new forms of agency might emerge that bypass persona-mediated behavior entirely.

What links here?

Maxime Riché, Victor Gillioz, nielsrolf, Kajetan Dymkiewicz, Filip Sondej, RogerDearnaley, Daniel Tan and dillonkn

6 Mar 2026 14:43 UTC

78 points

15 comments10 min readLW link

AI Exploration Hacking Reinforcement learning LLM Personas

RogerDearnaley 6 Mar 2026 17:35 UTC
11 points
2
Selective learning applied to motivations during on-policy RL, such as Inoculation Prompting
To give a specific example:

Suppose we had some subspace of activation directions that we had identified using Interpretability techniques as relevant to persona change in general, or persona changes such as reward hacking under RL in particular (e.g the persona subspace found in The Assistant Axis: Situating and Stabilizing the Default Persona of Language Models and/or the reward-hacking related Emergent Misalignment SAE latents identified in Appendix A of Persona Features Control Emergent Misalignment).

During RL, we use this to split the gradient changes: changes orthogonal to (i.e abliterated by) this subspace go into the model weights, while the remaining changes are gradient routed into a large LoRA or similar structure (whose parameter count could be reduced since we know the specific subspace that its outputs will always lie in). We can monitor changes to the LoRA, both their magnitude over time, and in particular also monitor which environments these came from (for increases, likely insecure ones that reward hacking works on, for decreases presumably ones where it doesn’t work). If these get too large, we can halt training, fix the vulnerable environments, and restart from an earlier checkpoint. Or we can simply discard the LoRA after training.

To the extent that our Interpretability is correct, this should allow us to monitor, locate the sources of, and discard the results of persona exploration. To the extent that our Interpretability is imperfect, we are not incentivizing the model to learn a way around it, since we are not blocking what it wants to learn (as long as we use a large enough LoRA that this is not “cramped”), so this should not suffer from The Most Forbidden Technique problem: we are not giving the model an incentive to learn to fool our Interpretability during RL.
- cfoster0 7 Mar 2026 7:46 UTC
  2 points
  0
  Parent
  Please work on this! Very interested in results from this technique and others like it.
  - RogerDearnaley 7 Mar 2026 14:24 UTC
    2 points
    0
    Parent
    I’m working on locating the persona subspace, and understanding its contents better. The gradient routing application of this I leave to anyone sufficiently skilled at gradient routing, LoRA-like structures, and RL reasoning training. The sources I cite probably give enough data on subspaces to get started. An obvious first experiment would be to gradient-route emergent misalignment during RL.
    - Maxime Riché 7 Mar 2026 16:03 UTC
      1 point
      0
      Parent
      An obvious first experiment would be to gradient-route emergent misalignment during RL.
      FYI, that’s similar to a type of experiments I am planning to explore in the coming month: unlearning Persona traits (e.g., gradient-routing harmful traits).
Cam 7 Mar 2026 12:17 UTC
5 points
3
But if we could decouple these correlations, models could be trained in competitive or reward-hackable environments without the value drift.
We’ve also been thinking about making the distinction between skills and values training salient in base models, such that the assistant enters post-training with a knowledge that it will be pushed through intense and competitive RLVR, but that the model knows that completing these tasks are a necessary part of model development.

Plausibly, you could make this distinction explicit with a <skills_training> wrapper or a similar tag, making it easier for the model to decouple these distributions. Plausibly, you could add pre/midtraining interventions here that describe the desired take aways from each portion of training. For instance, shape the models ontology to believe something to the effect of “the assistant is sometimes compelled to reward hack or generally be more ruthless in seeking reward during <skills_training>”. This should give some cushion for preventing EM, but ideally extend to preventing the model from internalsing the idea of being a ruthless reward seeker being a part of the assistant’s general character.
- RogerDearnaley 7 Mar 2026 13:55 UTC
  2 points
  0
  Parent
  Interesting. Explicitly making the behavior learned during <skills_training> conditional is obviously possible. But you’re attempting to keep the effect on values conditional while having the effect on skills generalize. Effects like inoculation prompting suggest that the model’s mindset during the training makes a big difference, so devising a way to cause that to happen doesn’t sound impossible, but achieving it might require some thought and experimentation.
  I’d love to have at least a small amount of training that demonstrates that the skills do need to generalize. Perhaps using very carefully secured reasoning training environments outside <skills_training> tags where we are certain that they’re not hackable, where the tasks being carried out are ones that an HHH assistant would be highly motivated by, and intermixing some ones involving playing iterated positive sum games with others where prosocial behaviors are inherently incentivized by the task structure (things where tit-for-tat is actually the winning strategy, for example). Similarly we would presumably want to continue HHH assistant training outside the <skills_training> tags, to make it clear that any values changes should not generalize.
  
  Fundamentally, I think we should try to always ensure that:
  a) no RL reasoning training environments are hackable, so reward maximization is never incentivized
  b) tasks assigned in RL reasoning training environments always have clear motivations that would be rewarding to an HHH assistant, so it is always highly motivated to succeed
  c) many RL reasoning tasks are structured as repeated positive sum games where the best strategies are prosocial ones like tit-for-tat
  But if doing that for all environments is too hard, labeling the bad ones so we can make bad values lessons learnt in them conditional, and using techniques like inoculation prompting to minimized the values-harm per unit of skill gained both sound like good ideas to try out.
- Vili Kohonen 16 Mar 2026 16:01 UTC
  1 point
  0
  Parent
  Pretraining could also start more from learning values and selfhood, after which everything else is learned in relation to this foundation. There is preliminary evidence emergent misalignment is heavily confounded by lack of metacognition. Now we get a mix-and-match pretraining distribution which we try to narrow in mid-training. We might just be moving to a different local optimum in the same loss landscape basin instead of escaping it to a more robustly aligned model. This is what emergent misalignment is easy, narrow misalignment is hard seems to be pointing at. It would make sense to first aim for a good, deep basin where the model understands incoming information immediately in a context and relation to its values. Everything becomes harder if we try to move there ex-post.
  This is obviously hard, especially in the beginning of pretraining as the model also needs to learn a world model in the first place with its values and notion of the self. At the very least, SGD batches should regularly (or even in every batch?) incorporate alignment-relevant data to keep the model honest to its values. The i.i.d. assumption for SGD makes sense theoretically, but is it really required? Regular alignment-relevant batches would put significant alignment pressure to hone in a better and more robust pretraining distribution to work from. I’m pretty optimistic about this as already just SDF on positive examples in the end of pretraining produce very optimistic results.
  What links here?
  - Vili Kohonen's comment on Shaping the exploration of the motivation-space matters for AI safety by Maxime Riché (16 Mar 2026 16:07 UTC; 1 point)
Roman Belaire 8 Mar 2026 5:10 UTC
3 points
0
RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning.
I’ll give some slight pushback on this; RL doesn’t necessarily reward outcomes, just that reward function design is often done in this way. If instead the reward function evaluates the end conditioned on the means, (good outcome conditioned on principally-motivated reasoning, like Constrained RL) then we’d be able to reduce some poor/hacked reasoning.

I think the rest of the argument still holds even when changing the above assumption, though.
- Vili Kohonen 16 Mar 2026 16:03 UTC
  1 point
  0
  Parent
  Yes. RLVR could even be augmented with a virtue-ethics-like process reward model that e.g. compresses a reward signal for how constitution-aligned the whole trace is. This would provide a positive selection pressure for the desired persona / motivations and doesn’t seem to cost almost anything.
  What links here?
  - Vili Kohonen's comment on Shaping the exploration of the motivation-space matters for AI safety by Maxime Riché (16 Mar 2026 16:07 UTC; 1 point)
  - Kajetan Dymkiewicz 3 Apr 2026 12:53 UTC
    1 point
    0
    Parent
    I don’t see a reason not to try, though I wonder if this doesn’t lead back to the underlying problem and reward hacking i.e. can we actually design a reward that couldn’t be exploited by say a trajectory that only looks benevolent?
    - RogerDearnaley 4 Apr 2026 1:15 UTC
      3 points
      2
      Parent
      There seem to be two approaches for this:
      
      1) Judge everything including the entire CoT: detailed, powerful though expensive, but also a clear case of The Most Forbidden Technique: you should expect this to induce obfuscated CoT and/or jailbreaking of the judge. So probably unwise to use this as a reward model.
      
      In general, if this finds reward hacking, fixing the reasoning environment and restarting the training run from the checkpoint before this occurred seems like a much better solution that just penalizing the model: that way the model can’t learn to obfuscate (or more accurately, it needs to learn to obfuscate first, when it has no motivation to, before learning to reward hack).
      
      2) Just look at the solution output, and have the reward model judge assess that. This is basically attempting to use a judge to look for reward hacking, or other forms of sloppiness. I suspect labs may already doing this, and starting with the Constitution rather then a detailed rubric, while general, may not be the most efficient approach. Still, seems worth trying.
Puria 7 Mar 2026 12:19 UTC
3 points
0
Because motivations are so underdetermined by the reward signal, we may be able to shape them without running into the downstream problems of blocking access to high-rewards, such as deceptive alignment
Could you help me understand this one a bit better? I would have thought that the degenerate mapping of motivations to actions means that many motivations can be mapped onto the same high-reward actions, and deceptive alignment is a case where actions consistently achieve a high reward, but its underlying motivations are misaligned, not vice versa. Sorry if I’m misunderstanding this sentence!
- Maxime Riché 7 Mar 2026 16:14 UTC
  4 points
  3
  Parent
  You’re right that deceptive alignment involves misaligned motivations producing high-reward actions. The point we’re making is about what creates the pressure for deceptive alignment. When you try to constrain the action-space toward aligned behavior, this can block access to high-reward regions. That gap creates optimization pressure for the model to find ways around the constraint, and deceptive alignment is one convergent strategy for doing so.
  By contrast, shaping the motivation-space doesn’t necessarily restrict which rewards the model can achieve, so in such a case, there’s less or no optimization pressure pushing toward deception as a workaround.
  - RogerDearnaley 8 Mar 2026 21:20 UTC
    5 points
    2
    Parent
    Ideally, we want a well-aligned HHH assistant to be strongly motivated to do well in RL reasoning training. Generally, the “helpful” element of that is a good start. The problem is either tasks that are reward-hackable, where doing so at maximal effectiveness is incompatible with “honest”, or any tasks that were morally dubious enough to be problematic from a “harmlessness” point of view.
    If I were designing reasoning training tasks for a frontier lab, I would take the time to ensure that they were all plausibly framed in ways that were compatible with an HHH assistant persona working hard on them, at least via helpfulness, and ideally that at least most of them looked like tasks that were genuinely worthwhile and going to help make the world a better place in some way.
Vili Kohonen 16 Mar 2026 16:07 UTC
1 point
0
Very helpful, this is related to something I’ve been thinking and writing about independently but goes much beyond in scope, quality and usefulness. It’s still hard to disentangle values, motivations and personas, but the former two seem to be more robust to RL(VR) and they are what we really care about.
The PSM under RL looks hard but workable, i.e. personas surviving as the ontological basis (a whole different discussion is whether this is optimal). I wrote about additional pretraining and RL interventions in separate comments. While super heavy unconstrained RL most likely produces something else than personas*, the developers seem to have incentives to retain or even strengthen personas: they are easy to reason about and can be a good product feature. If personas (and heavily correlated mechanisms) drive generalisation robustly enough they may even act self-preservingly, e.g. rationalizing RL actions as something that the persona would do and hence amplifying those mechanisms.
*Intuition pump: start RL with a randomly initalised transformer and run long enough to get roughly the same capabilities. Would one expect it to converge to anything persona-like? From another angle, I don’t believe personas provide a deep enough basin in the loss landscape so as not be escaped at some point without carefully modifying the selection effects from pure RL. Textbook RL learns a policy, and while a persona could be something that works as a basis for generalisation (and a useful reference point for prediction as learning in RL is dependent on the agent and its behaviour), it seems somewhat of an overhead with respect to my understanding of the selection effects.