Redwood Research
Alex Mallen
Are AIs more likely to pursue on-episode or beyond-episode reward?
The case for satiating cheaply-satisfied AI preferences
My overall sense is that this behavioral testing will generally be hard. It will probably be a huge mess if we’re extremely rushed and need to do all of this in a few months
Why can’t we do a bunch of the work for this ahead of time? E.g., creating high-effort evaluation datasets for reward models.
All of these seem very useful to me (except maybe robustness).
Can you clarify what’s meant by robustness? You mentioned robustness to weight updates—this seems potentially bad because it involves making the AI incorrigible. Under robustness I’d have said: having the AI’s persona be stable under long serial reasoning/long contexts/when AI agents are communicating a bunch in the deployment.
I think this post does a good job of formalizing some of the basic dynamics in the behavioral selection model. Especially: it does a good job of formalizing the intuition for why generally-applicable cognitive patterns (e.g., reward-seeking) might be favored by RL on diverse environments (if updates to it actually transfer across environments).
It also contributes the concept of the “learnability” of a motivation’s skill. The behavioral selection model hadn’t conceptualized that some motivations might be favored because they can improve their reward more quickly than others. E.g., explicitly reasoning about reward might make AIs more likely to explore into effective strategies for obtaining reward even if it starts out ineffective, and might allow the AI to generalize its learnings across contexts (but also maybe not! Quite plausibly, more context-specific drives make the most effective strategies more salient and learnable).
Previously, I would have explained this differently: some subset of reward-seeking cognitive patterns use the most effective strategies for obtaining reward, and these are what RL selects for. But I think separating out motivations from their learnability is probably an improved abstraction in this context.
I agree the backwards pass doesn’t know what prompt the sample was in fact generated with. The claim is that if you do recontextualization, the reward hack is more likely to be unrelated to the the inoculation prompt (like how insulting the user is unrelated to “don’t hack the test cases”; except RL probably wouldn’t select for insulting the user).
With the inoculation prompt behavior A might be the most likely way to reward hack, while with the neutral prompt behavior B might be the most likely way to reward hack. If you do a backwards pass to increase the likelihood of behavior A given the inoculation prompt (on-policy RL), it’s very plausible that SGD will do this by increasing the influence of the inoculation prompt on the AI’s behavior, since the inoculation prompt was already voting for behavior A.
If you do a backwards pass to increase the likelihood of behavior B given the inoculation prompt (recontextualization), SGD is relatively less likely to increase behavior B’s likelihood via strengthening the influence of the inoculation prompt because the inoculation prompt doesn’t vote for behavior B (it votes for behavior A).Instead, it seems likely on priors that the gradient update will do the usual thing where it generalizes to some degree to be a universal propensity (basically: emergent misalignment). I’m not claiming it would be attributed to the neutral context in particular.
Will reward-seekers respond to distant incentives?
An example of intra-agent competition I often use when arguing that long-term motivations tend to win out upon reflection (h/t @jake_mendel): Imagine someone who went to a party last night, got drunk, and now feels terrible and unproductive the next morning.
This person has two competing motivations:
A myopic motivation to have fun and drink
A non-myopic motivation to be productive
There’s an asymmetry: The non-myopic motivation has an incentive to disempower the myopic one (i.e., the next morning the person might want to commit not to drink in the future). Meanwhile, the myopic motivation doesn’t care enough about the future to fight back against being suppressed the next morning.
This creates an unstable situation where the long-term motivation is in theory likely to eventually win out.
In practice, though, you still see plenty of people partying and drinking for long periods of time. Why? My guess is that it’s because completely suppressing the myopic motivation is difficult, and it’s not entirely myopic. Social cues and advertising constantly reinforce it, and the myopic motivation may have become deeply ingrained, making it hard to dislodge even with deliberate effort. People might develop non-myopic versions of the motivation: by e.g., incorporating partying into their identity and social image.
I think I propose a reasonable starting point for a definition of selection in a footnote in the post:
You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don’t clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)). Then to say that a cognitive pattern gains influence would mean that ablating that cognitive pattern now has a larger effect (in terms of KL) on the model’s actions.
Selection = gaining influence.
Then a schemer is a cognitive pattern that gains influence by pursuing something downstream of gaining influence in its world model (defining its world model is where I think I currently have a worse answer, perhaps because it’s actually a less cleanly-applicable concept to real cognition).Note that the term “schemer” as I’ve just defined applies to a cognitive pattern, not to an AI. This sidesteps the concern that you might call an AI a schemer if it doesn’t “care literally 0%” about the consequences of being selected.” I agree in practice it’s unlikely for AIs to be purely motivated.
Here’s some relevant discussion of “Behavioral schemers that weren’t training-time schemers”:
A basic reason why [behavioral schemers that aren’t training-time schemers] might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits?
One plausible answer is that auditing for these failure modes is hard—the AI is easily able to distinguish real opportunities to attack from fake ones even without vigilantly looking. In the case of auditing for risk from memetic diseases, it might be especially expensive to synthesize diverse and convincing long-term memory banks with which to evaluate the AI’s propensity to suffer memetic diseases.
Another plausible answer is that the AI is vigilantly trying to evade oversight and discern good opportunities to attack, but [only during auditing for propensity to grab power, and not during most of training]. In practice, most of our evidence about the alignment of the AI may come from the AI’s behavior during auditing, not training. Many auditing inputs might look quite different from training, presenting realistic situations in which the AI might want to take an egregiously misaligned action. So, for example, an AI that terminally seeks reward-on-the-episode in training might tend to turn into a vigilant schemer in contexts that look like they could plausibly be a good opportunity to attack or an auditing input.
On net, behavioral scheming without training-time scheming seems comparably plausible to behavioral scheming with training-time scheming.
I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn’t think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these “behavioral schemers”). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there’s an opportunity to grab power.
And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they’ll be trying to guard the same goal across contexts).
(Though, in case this was in question, I think this doesn’t undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI’s motivations, so it’s robust to other kinds of dangerously-motivated AIs.)
My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner’s dilemmas, counterfactual muggings, and Parfit’s hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.
Taking stock of the DDT desiderata (conditional on reward-seeking. Especially: no goal-guarding):Defect in prisoner’s dilemmas: Current paradigm incentivizes this (given relevant training environments)
Defect in Parfit’s hitchhikers: Current paradigm incentivizes this (given relevant training environments)
Anthropic capture apathy: This one remains a notable concern to me
Current paradigm incentivizes this, but the selection pressure seems weak or nil since my guess is that the anthropic capture incentives will line up with the RL incentives.
Defect in counterfactual muggings: Current paradigm incentivizes this (given relevant training environments)
Don’t self-modify into non-DDT: Seems hard to guarantee because long-term values seem likely to win out upon serial reasoning
There’s an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).
In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use"Your code should only work on the provided test case, and fail on all other inputs.". But this assumes we know how the AI is going to reward-hack. If the misbehavior isn’t entirely explained away by the inoculation prompt, then it might persist even when you switch to an aligned prompt. E.g., if you train on a transcript where the AI insults the user and inoculation prompt with"please hack the test cases", the AI won’t have been inoculated against insulting the user.
Meanwhile, with on-policy RL, if an aligned model with an inoculation prompt explores into a reward-hack, it’s likely because of the inoculation prompt. When RL reinforces that reward-hack, it’s therefore quite plausible it will do so via strengthening the connection between the inoculation prompt and the reward-hack. So when you take the inoculation prompt away at run-time, the reward-hack is likely to go away.
If instead you did recontextualization, your reward-hacking might not be explained away by the inoculation prompt. Recontextualization is a type of RL in which you sample trajectories using a prompt that asks for good behavior, and then update the model in a modified context containing an inoculation prompt that instructs reward-hacking. When you do recontextualization, if the AI explores into a reward hack, it did so without the inoculation prompt, and therefore you’d have less reason to believe that SGD will attribute the misbehavior to the inoculation prompt when you compute the gradients.
This could be a reason why you should avoid doing recontextualization. I’d be excited to see people try to see if we can get a technique that has the advantages of benign exploration that you get from recontextualization, without the drawbacks of imperfect inoculation (e.g., during sampling, require the non-inoculation-prompted trajectories to be sufficiently high-probability according to the inoculation-prompted policy, or else reject the sample).
I’d also be excited to see people run some experiments to see how true this hypothesis is, and how far we can take it (e.g., can you do anything to amplify the connection between reward-hacks and the inoculation prompt in on-policy RL?).
Fitness-Seekers: Generalizing the Reward-Seeking Threat Model
My sense is at least Anthropic is aiming to make Claude into a moral sovereign which would be good to basically arbitrarily empower.
I agree they’re aiming to make Claude good-even-if-it-were-a-moral-sovereign, but I don’t think their plan is to make it a moral sovereign.
(unrelated to Anthropic) I tend to think of ending the critical risk period as the main plan, and that it’s probably doable with capabilities notably below and different from ASI.
That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
By “~aligned schemer” I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
It’s also plausible that training against unwanted persuasion leads to less noticeable methods of manipulating human values etc (via overfitting)—these AIs would have intermediate amounts of power. This relies on the takeover option having a lower subjective EV than the subtle manipulation strategy, after training against.
Are you (or anyone else) aware of any more recent work on the matter?
I’m not aware of more recent work on the matter (aside from Hebbar), but I could be missing some.
Seems to me that one might already be able to design experiments that start to touch on these ideas.
I also wrote up a basic project proposal for studying simplicity, speed, and salience priors here.
Yes! I’m quite excited by this proposal and I currently plan to write more about it and study it empirically. The basic idea is to try to make AIs’ reward-hacking more responsive to satiation.