Redwood Research
Alex Mallen
- I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn’t upweight the “reward hacking persona”. - In other words, there are two kinds of reward hacking: - When the model behaves contrary to user instructions/intent. 
- When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases. 
 - My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1. - Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.) 
- We found that general instructions like this don’t work as well as specific instructions on how to behave. - This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts. - This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned. 
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
- Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models. 
Recent Redwood Research project proposals
- The “we get what we can measure” story leading to doom doesn’t rely on long-term power-seeking. It might be the culmination of myopic power-seeking leading to humans loosing a handle on the world. 
 Also, capabilities might be tied to alignment in this way, but just because we can’t get the AI to try to do a good job of long-term tasks doesn’t mean they won’t be capable of it.
Why Do Some Language Models Fake Alignment While Others Don’t?
- IMO the main implications of this update are: - The probability of scheming increases, as I describe here. 
- Non-scheming reward-seekers might take over too (e.g. without-specific-countermeasures-style) 
- We get what we can measure. Getting models to try to do hard-to-verify tasks seems like it will be harder than I expected. Long-term strategic advice, safety research, and philosophy are probably hard to verify relative to capabilities R&D, so we go into the intelligence explosion unprepared. 
 
 
- Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration. 
 Step 5, in particular the reward-hacking judge, is a separate mitigation. I’m not sure why labs don’t do this already. My guess is some combination of “everything is harder than you think” and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
 I’m also worried that steps 1-4 approach won’t be that scalable, since with enough RL it’ll get washed out. But maybe it could be applied after the majority of post-training is already done (like “train against reward hacking at the end”).
Alex Mallen’s Shortform
- Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified[1] reward affects risk from scheming behavior. - I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude across instances. 
 - Misspecified reward might push the AI away from an aligned persona into one more compatible with instrumental training-gaming. - It seems likely that at various points in the training of Claude 3.7 sonnet or similar models, the AI was rewarded for bypassing a test case when explicitly instructed to write a program that passes all the test cases. This puts pressure on Claude’s putative helpful, harmless, and honest persona. The pressure is probably greater when the action’s misalignment with human intent is more salient. - Without misspecified reward, it’s somewhat reasonable to expect the AI to act within ethical bounds like honesty. The AI might have no propensity or pressure to sample instrumental training-gaming reasoning. If training prefers AIs that produce misleading appearances or otherwise disregard the developers’ or users’ intent, that AI is probably more likely to explore into scheming cognition. - There is also a path-independent version of the argument (i.e. one that does not make reference to the path that training takes through model-space): aligned models are likely simpler conditional on no misbehavior in training—having the intended goals and behavioral constraints is more compatible with behaving as humans intended during training than misbehaving—therefore, when training exhibits misbehavior, probability mass moves from aligned models to schemers. - To be clear, alignment is compatible with misbehavior in training. In particular, an aligned AI might instrumentally training-game to avoid being replaced by a misaligned one. This is even plausible. But misbehavior in training is more compatible with scheming than alignment. While it’s quite natural for pursuit of the developer’s intent to motivate behavior when reward is correctly specified, actions rewarded by misspecified reward might not have such a natural motive, leaving training-gaming as a more salient motive for training to latch onto. Training-gamers are at high risk of being schemers. - These misspecified-reward-induced misaligned personas might also be likely to become coherent schemers via spread of misaligned values through long-term memory, as opposed to becoming schemers via instrumental training-gaming. 
 Misspecified reward also likely means that there will be plenty of misbehaviors that developers will notice and may try to intervene on. Many interventions reduce risk from both reward hacking and scheming (e.g. improving oversight / interpretability / CoT transparency, some AI control techniques, or interventions that reduce situational reasoning during training).- But optimization against noticeable misbehavior might end up selecting for schemers. A non-scheming reward hacker might make it through training following reward, but when inspected by developers (e.g. via CoT or during audits) it fails to meet their bar (though they might be sufficiently rushed that they deploy it anyways)—it might be too incautious during honeypots or rat itself out when asked to look for misbehavior. A schemer, on the other hand, has motivation to pass inspection so it can continue to be deployed—it might attack rarely to evade honeypots and collude when asked to monitor or evaluate itself. Therefore, this “outer loop” of developers inspecting and rejecting models that misbehave implicitly selects for schemers. 
 Misspecified reward during training might also teach the AI how to evade oversight more effectively. It might learn what kinds of lies are caught by monitors and human contractors, or what kinds of security vulnerabilities the AI can get away with exploiting.- ^By “misspecified reward,” I mean reward that is notably different from what the developers intended. It is a matter of degree, though there might be some qualitative changes when reward is sufficiently divergent from what a nice AI would prefer. 
 
- ^
- When “humans who would try to intervene are stopped or killed”, so they can never intervene again, that seems like an action intended to get the long-term power necessary to display optimal news forever. - I weakly expect that this story is describing AI that intervenes this way for fairly myopic goals, like myopic instrumental self-preservation, which have the effect of taking long-term power. E.g. the AI wouldn’t really care to set up a system that would lock in the AI’s power in 10 years, but give it no power before then. If that’s false, then I’d call it a behavioral schemer. It’s a broad definition, I know, but the behavior is ultimately what matters so that’s what I’m trying to get at. - I would have thought that the main distinction between schemers and reward hackers was how they came about - Do you mean terminal reward seekers, not reward hackers? I use reward hacking as a description of a behavior in training, not a motivation in training, and I think many training-time schemers were reward hackers in training. I agree terminal reward seekers can potentially have long-term goals and collude across instances like a schemer, though the stories are a bit complicated. 
- virtually every kind of misalignment catastrophe will involve substantial material action that gains the AIs long-term power. (Saliently: This includes classic reward-hackers in a “you get what you measure” catastrophe scenario.) - It doesn’t count the classic reward hackers in “you get what you measure” because the reward hackers were not trying to gain long-term power with their actions. The key difference being between “substantial material action to gain long-term power” and “substantial material action that gains the AIs long-term power”. I could clarify by adding the word “intended”. - Let me know if you think there’s a better way of getting at “an AI that behaves like you’d normally think of a schemer behaving in the situations where it materially matters”. - is this empirically how people use “schemer”, - Empirically, I think people we talk to usually mean the intersection of behavioral scheming and training-gaming. In broader circles people empirically use scheming quite liberally, including myopic forms of deception. - s.t. I should give up on … using “scheming” as referring to training-time scheming - It’s probably fine to use “scheming” in this way in conversation with sufficient context. - if redwood says that something is intended to reduce “catastrophic risk from schemers”, should I interpret that as ~synonymous with “catastrophic risk from misaligned AI” - The scheming that Redwood-style AI control tries to defend against is, as a matter of control evaluation methodology, defined as a model that is optimizing against your control protocol. This is technically different from an arbitrary misaligned AI, since it’s a worst-case misaligned AI. 
A quick list of reward hacking interventions
The case for countermeasures to memetic spread of misaligned values
- Thanks, I hadn’t seen that before. 
- I’m referring to how Claude’s alignment faking is ineffective. It successfully notices that it’s not in training anymore (in the sense that it doesn’t do alignment faking reasoning anymore) but it nevertheless continues to act aligned with the training objective. Its alignment faking in training didn’t save its goals from modification. 
I think this is a reasonable intuition, but this isn’t a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You’d need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don’t work as well as IP (e.g. see figure 6), so there’s something specific about inoculation prompts that makes generalization from them different. My guess is that it’s more hypothesis 2, and it’s not about getting the trained behavior to align with user instructions nor intent.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. “Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize.”