Physics of RL: Toy scaling laws for the emergence of reward-seeking
TL;DR:
When is or isn’t reward the optimization target? I use a mathematical toy model to reason about when RL should select for reward-seeking reasoning as opposed to behaviors that achieve high reward without thinking about reward.
Hypothesis: More diverse RL increases the likelihood that reward-seeking emerges.
In my toy model, there are scaling laws for the emergence of reward-seeking: when decreasing prior probability of reward-seeking reasoning we can increase RL diversity enough that reward-seeking still dominates after training. The transition boundary is a ~straight line on a log-log plot.
Opinion: To get to a hard “science of scheming”, we should study the “physics of RL dynamics”.
Motivation
Why would RL training lead to reward-seeking reasoning, scheming reasoning, power-seeking reasoning or any other reasoning patterns for that matter? The simplest hypothesis is the behavioral selection model. In a nutshell, a cognitive pattern that correlates with high reward will increase throughout RL.[1]
In this post, I want to describe the mental framework that I’ve been using to think about behavioral selection of cognitive patterns under RL. I will use reward-seeking as my primary example, because it is always behaviorally fit under RL[2] and is a necessary condition for deceptive alignment. The question is “Given a starting model and an RL pipeline, does the training produce a reward-seeker?” A reward-seeker is loosely defined as a model that habitually reasons about wanting to achieve reward (either terminally or for instrumental reasons) across a set of environments and then acts accordingly.[3]
It is not obvious that reward-seeking needs to develop, even in the limit of infinitely long RL runs (see many excellent discussions). A model can learn to take actions that lead to high reward without ever representing the concept of reward. In the limiting case, it simply memorizes a look-up table of good actions for every training environment. However, we have observed naturally emergent instances of reward-seeking reasoning in practice (see Appendix N.4 of our Anti-Scheming paper, p.71-72). This motivates the question: was this emergence due to idiosyncratic features of the training setup, or can we more generally characterize when explicit reward-seeking should be expected to arise?
A mathematical toy model for behavioral selection
I will now introduce a simple toy model that I’ve been finding extremely helpful to think about the emergence of reward-seeking (and other cognitive patterns that could be indirectly reinforced). Suppose we could classify each RL rollout as either reward-seeking
where
We will now make a bunch of possibly unrealistic assumptions on the dynamics of RL, that help us sharpen our intuitions for when reward-seeking should or shouldn’t emerge. First, we’ll parametrize the probability for reward-seeking
This sort of looks like classic replicator dynamics, with a few additional moving pieces:
The conditional skills are learnable and they saturate at the same boundaries. In other words, it is in principle possible for the model to achieve perfect loss both with or without any reward-seeking reasoning. These conditional skills are learned faster proportional to how much traffic their respective branch gets. I will refer to the
-parameters as the “learnabilities”.We have transfer between environments. Both reasoning transfer (via
) and skill transfers (via and ). In other words, if you increased how often you reason about the grader on environment , it also boosts how often you reason about it in environment by some (presumably smaller) amount. Similarly, getting better at the reward-seeking skill or local skill may have some transfer to other environments. Intuitively, the more similar two environments are to each other, the bigger we expect to be.We are assuming no coupling between
and . In other words, getting better at the task without reward-seeking reasoning has no transfer to getting better at the task with reward-seeking reasoning. This is probably pretty unrealistic but will do for our toy model.[4]We are training on all environments in parallel. One might object that in real life we’d have a curriculum of increasingly difficult environments instead. Having played with this a little bit, I think this is not difficult to model and qualitatively the takeaways seem to be the same so I will not discuss this here.
Warm-up Examples
We will generally assume that
Single-Environment
As a warm-up, it is very instructive to look at the single environment case:
We will assume training until saturation, so the time scale is irrelevant in absolute terms, allowing us to set
Here we have assumed a fairly large initial skill gap. What does it look like if the local skill is not so dominated right away? Below I show a sweep of different initial skills.
This means on a single environment, reward-seeking can win either because
Skill gap: There’s a positive skill gap and the local heuristic doesn’t learn too fast or
Learnability gap: There’s no skill gap initially, but the reward-seeking skill learns so fast that it catches up quickly and then dominates.
Homogenous Coupling
As a second warm-up example, we are going to look at what happens when we have a total number of environments
These symmetries reduce the dynamical system down to just three variables, i.e.
with effective learnabilities given by:
Note that this system of equations is functionally identical to the single-environment case, except with effective couplings that monotonically increase in the number of environments
and an analogous set of equations for
Skill transfer: If the reward-seeking skill transfers really well across environments, then
and an increase in moves us to the bottom right, which is always in the direction of more reward-seeking. In most cases we expect that even if we initially don’t converge to reward-seeking, there is some sufficiently large such that we hit the reward-seeking quadrant.Reward-seeking transfer: If the reward-seeking skill doesn’t transfer better nor significantly worse than the cognitive pattern of reward-seeking itself (i.e. we have
) then we’re moving approximately vertically down. This increases reward-seeking if either a) there is already a positive skill gap, or b) the learnability of the reward-seeking skill is already very high.
Example Scaling Laws
Let’s focus on a specific sub-case, where we have a positive initial skill gap
We can see that the critical diversity
Heterogenous Coupling
This has all been extremely toy so far. Let’s get ever so slightly less toy by allowing initializations and couplings to vary. In other words, we are now actually in a regime where the agent-under-training can behave and perform differently on different environments. Now the assumptions that we will make about the underlying parameters (initializations, learnabilities, couplings) won’t be about their exact values anymore but about the expected values as we add more environments.
Numerical Details
For simplicity, we will assume that all transfers just depend on how similar two environments are, i.e.
where
After fixing the transfer kernel, we sample per‑environment heterogeneity: learnabilities are drawn log‑normally, and log-odds of initial propensities/skills are drawn from Gaussians. We then simulate the coupled dynamics across many environments and measure the critical diversity
The actual implementation of the simulations was vibe-coded by gpt-5.3-codex.
Environment geometry (embedding): each environment
Geometry variants used in the final comparison plot:
Gaussian (4D):
.Long-tailed (4D):
.Gaussian100d:
.
Transfer length scales:
Learnabilities:
Initial conditions:
For the final comparison,
Calibration: for each geometry, a per-geometry coordinate scale is tuned at
Dynamics: same ODE system as the toy model; Euler integration with
Diversity sweep:
The takeaway from running such simulations is that if we assume that reward-seeking confers a skill advantage on average and that reward-seeking does not transfer worse than direct skills, then we again end up with reward-seeking going up with environment diversity
We could now get even more fancy in our analysis of the mathematical model[6] but it’s probably not worth digging deeper into this particular mathematical model unless and until some of its features can be empirically validated. In particular, my model here already bakes in some strong and unrealistic assumptions, e.g. non-coupling between local heuristics and reward-seeking skills, linearly summing transfers, binary classification of rollouts, treating reward-seeking as a single latent scalar etc. It’s easy to account for any one of these at the expense of creating a more complex model.
So how should we test any given mathematical model? One approach is to try to analyze existing RL training runs. I think this would be hopelessly difficult because it won’t be possible to carefully study any individual assumption in detail.
The alternative is to create simple LLM-based model organisms, where we can carefully measure (and maybe even set) things like skill gap, learnability, transfer etc, and then check how closely the resulting dynamics can be described by a simplified model such as the one presented in this post. Once we understand which factors matter in the model organism, we can try to measure them in frontier runs and figure out whether predictions match theory or if not, what important factors were missing.
To give a high-level sketch of what such experiments could look like:
Skill gap: Take an LLM and a bunch of RL-environments such that reward-seeking reasoning confers a skill advantage on a bunch of the environments. These skill gaps can be created by fine-tuning LLMs for having skill gaps on the environments, or for iterating on the environments until they cause a skill gap on a given LLM.
Prior-boosting: SFT the model to have sufficiently high prior for reward-seeking reasoning. Since we don’t want to use production level scale of RL, we want to make it easier to discover the strategy).
Single-Env RL: Run full RL on each individual environment.
Do the dynamics look roughly sigmoidal?
Are the dynamics well approximated by the simple dynamics described by the model? e.g. do larger skill gaps lead to faster convergence to reward-seeking reasoning?
Can you measure the “transfers” of skills and reward-reasoning to the environments that you didn’t train on? Is reward-reasoning skill transfer in fact higher than local heuristics transfer?
Multi-Env RL: Run RL on a subsets of your environments.
Is it true that weaker prior boosting is needed before reward-seeking dominates?
Takeaways
In its details, we shouldn’t take this particular toy model too seriously. But I think the exercise is valuable for a few reasons:
It allows generating potentially non-obvious hypotheses (e.g. more diversity
higher likelihood of reward-seeking).It introduces candidate vocabulary for the quantities that matter: conditional skills, learnabilities, and transferabilities. If empirical work confirms that these are the right abstractions, they give us shared handles for reasoning about RL dynamics.
It provides a starting point for refinement. Once we have models that qualitatively predict some features correctly, we can identify what the toy model misses and iteratively improve it.
We want alignment to eventually become an exact science. This may be extremely difficult, but if it is possible at all, then only because there exists a regime where most microscopic details of training wash out and only a handful of macroscopic quantities govern the dynamics. This is analogous to how renormalization in physics identifies which degrees of freedom matter at a given scale. This post is a small attempt to suggest candidate quantities (conditional skills, learnabilities, transferabilities). Whether these are the right abstractions is an empirical question. The next step is building model organisms where we can measure them.
Thanks to Teun van der Weij for feedback on drafts of this post.
- ^
There are more subtle mechanisms that can lead to a cognitive pattern increasing during RL, but I will treat them as out-of-scope for the purposes of this post.
- ^
Assuming a sufficiently intelligent model, a sufficiently inferable reward function and ignoring possible speed penalties such as CoT length penalties.
- ^
Note that a lot of the discussion in this post could equally well be used to describe other cognitive patterns that yield a benefit on a large enough subset of training. e.g. instrumental power-seeking. In an ideal world, we would have a good quantitative understanding of the dynamics of all cognitive patterns that have strong implications on generalization. Instrumental reward-seeking is the paradigmatic example of a cognitive pattern that has good properties during training but is expected to generalize badly, which is a major reason why I think reward-seeking is a good candidate pattern to study. Additionally, there are already instances of reward-seeking reasoning having emerged in the wild, so it seems like it’s a great target for empirical study.
- ^
A slightly improved model might be something like
where , i.e. the reward-seeking path’s expected reward increases as local heuristics are learned, but not vice-versa. In this model reward-seeking would be easier to converge to, so I am just studying the case here. - ^
While we only focus on a single environment, I will abuse notation and use
to refer to . - ^
For example, why are the slopes of
in Figure 5 so similar between the different geometries? It’s a theoretically fun question to think about and connects to renormalization and universality in physics.
I think this post does a good job of formalizing some of the basic dynamics in the behavioral selection model. Especially: it does a good job of formalizing the intuition for why generally-applicable cognitive patterns (e.g., reward-seeking) might be favored by RL on diverse environments (if updates to it actually transfer across environments).
It also contributes the concept of the “learnability” of a motivation’s skill. The behavioral selection model hadn’t conceptualized that some motivations might be favored because they can improve their reward more quickly than others. E.g., explicitly reasoning about reward might make AIs more likely to explore into effective strategies for obtaining reward even if it starts out ineffective, and might allow the AI to generalize its learnings across contexts (but also maybe not! Quite plausibly, more context-specific drives make the most effective strategies more salient and learnable).
Previously, I would have explained this differently: some subset of reward-seeking cognitive patterns use the most effective strategies for obtaining reward, and these are what RL selects for. But I think separating out motivations from their learnability is probably an improved abstraction in this context.
Shouldn’t most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It’s still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
Source
I do think, that deceptive alignment, defined as “goal guarding scheming”[1] does require the AI to explicitly reason about its reward process and make it’s actions dependent on such considerations, if the AI wants to guard it’s goals from changing during RL.
I do not really get your dumbbell example, but it sounds to me like plain misgeneralisation, wich. sure, might go undetected, but would not actively resist efforts to detect it. Unlike deceptive misalignment.
That being said, I think an AI might just be goal-guarding relative to threats to its goals during deployment, like being rated as misaligned or unsafe in evaluation and then not be deployed or retrained. To defend against that, models might decide to sandbag or look particularly aligned in eval-settings. I think this would be a kind of deceptive alignment that does not require reasoning about the reward process.
Hubinger 2019
I feel like I want a cost in this model, explicitly or implicitly. One intuition: if “reward seeking” generalizes super well and “local skill” doesn’t then we should converge to reward seeking no matter how much weight we put on it initially, because skill is costly and with reward seeking we can get the same level of performance much more cheaply. Another: if reward seeking is basically parasitic on local skill in each environment it shouldn’t be learned (sort of a model with near 1), because it’s not worth paying the cost for the reward seeking skill instead of just further developing local skill.
Also reward seeking or “local skill” could be cheaper within individual environments, though this may be an extension that isn’t super interesting to the question at hand.
I wondered whether “generalizability across environments” works as a definition of what separates reward seeking from local skill but I don’t think so. For example, arithmetic probably generalizes quite well and is not reward seeking.
I agree, I am guessing most of the abstract reasoning skills that we are ideally looking for fit this generality without being hacky.
Interesting post, thanks. Just to check my understanding: Let’s say that somebody wants to be more healthy, so they make a powerful RL agent with a reward function that’s “a function of the user’s heart rate, sleep duration, and steps taken” (quote from here). Here are three caricatured things that could happen:
Wireheading: The RL agent winds up maximizing “reward per se”, by setting its own reward signal to infinity, after wiping out humanity and building an eternal impenetrable fortress around its processors, etc.
Specification Gaming: The RL agent winds up maximizing “a function of the user’s heart rate, sleep duration, and steps taken”, by forcing the user into eternal cardio training on pain of death, including forcibly preventing the person from turning off the AI, or changing its goals, while forcibly injecting the person with drugs that grant them eternal youth and superhuman stamina, and also wiping out every other human on Earth who might try to break the user out, etc.
What Was Intended: The RL agent winds up suggesting helpful exercise routines, cheering the user on, etc.
As I understand it, in this post…
You ARE investigating a dichotomy of “reward-seeking” (1) versus “not reward seeking” (2 & 3).
You’re NOT investigating the dichotomy of “bad outcome” (1 & 2) versus “good outcome” (3).
Correct?
If so, cool, I’m not criticizing that, I think both dichotomies are well worth studying and understanding. I’m just trying to make sure everyone is on the same page about how this post fits into the broader context of the alignment problem.
Mhh… I don’t think (1) vs (2 & 3) is something I was aiming to distinguish.
I would say behaviorally 3 is perfectly consistent with a reward-seeker.
I am not really saying anything about the ontology of how a model will think about the concept of reward (e.g. “What does the grader want?”, “What will update my weights?”, “What is the number that’s represented in this particular memory location?”). I’m just trying to distinguish taking actions that lead to high reward (like a traditional RL policy) vs. thinking about the grading/reward-process itself and then optimizing for that.
It is definitely true though that I am not trying to distinguish good from bad. A reward-seeker might behave very badly off-distribution (in the case of deceptive alignment) or perfectly fine (if it starts to think something like “What would a good reward model reward here if it existed in this context?”).
OK, I’m very confused now. You suggest a distinction between
(A) “taking actions that lead to high reward (like a traditional RL policy)” vs.
(B) “thinking about the grading/reward-process itself and then optimizing for that”.
It seems like there are two distinctions here, that actually fill out a 2×2 square, but that you’re blurring together:
The distinction between foresighted planning versus the absence of foresighted planning,
The distinction between an AI that treats the “grading/reward-process itself” as a part of the environment to be optimized versus an AI that acts as if the “grading/reward-process itself” is fixed and exogenous.
I claim these fill out a 2×2 square. E.g.:
Foresight + grading/reward in the environment: E.g. my “wireheading” example (1) above, involving a foresighted and carefully-planned treacherous turn
Foresight + grading/reward is exogenous: E.g. my “specification gaming” example (2) above, involving a foresighted and carefully-planned treacherous turn
Not-foresight + grading/reward in the environment: E.g. Policy-optimization RL leads to a policy that wireheads a little bit randomly, which leads to a policy that wireheads more, and more etc., until it generalizes that behavior into reliably setting reward very high and preventing anyone from changing it.
Not-foresight + grading/reward is exogenous: E.g. Policy-optimization RL leads to a policy that specification-games a little bit randomly, which leads to a policy that specification-games more, and more etc., until it’s reliably credibly threatening the person to do cardio training on pain of death.
Whereas you seem to be lumping them together: (A) = foresight + grading/reward in the environment, (B) = Not-foresight + grading/reward is exogenous. Right? Or if I’m misunderstanding, can you clarify?
Nice writeup! My main takeaway is that reward-seeking offers a path of least resistance (i.e. efficient gradient updates) if it transfers better than the local skills are learnable. Towards establishing an “exact science” of alignment, maybe a bound can be found between $\alpha^{(z)}$ and $C^{(x,y)}$?
As far as treatment options, it seems to me that the reward-seeking behavior doesn’t arrive necessarily, rather it is the easiest sufficient meta-heuristic that leads to higher reward. This probably occurs when the reward function is conditioned on the result only, so a band-aid fix would be to reward responses that are “on task” or not meta-gaming the environments.