Redwood Research
Alex Mallen
That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.
By “~aligned schemer” I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.
It’s also plausible that training against unwanted persuasion leads to less noticeable methods of manipulating human values etc (via overfitting)—these AIs would have intermediate amounts of power. This relies on the takeover option having a lower subjective EV than the subtle manipulation strategy, after training against.
Are you (or anyone else) aware of any more recent work on the matter?
I’m not aware of more recent work on the matter (aside from Hebbar), but I could be missing some.
Seems to me that one might already be able to design experiments that start to touch on these ideas.
I also wrote up a basic project proposal for studying simplicity, speed, and salience priors here.
To be clear, “influence through deployment” refers to a cognitive pattern having influence on behavior in deployment (as I defined), not long term power seeking.
Thanks for the feedback! I partially agree with your thoughts overall.
All three categorizes of maximally fit motivations could lead to aligned or misaligned behavior in deployment.
This is technically true, though I think that schemers are far more dangerous than fitness-seekers. IMO, more likely than not, a fitness-seeker would behave similarly in deployment as compared to training, and its misaligned preferences are likely more materially and temporally bounded. Meanwhile, misaligned schemers seem basically worst-case likely to takeover. Even if you end up with an ~aligned schemer, I’d be pretty concerned because it’s incorrigible.
I think further thinking about the prior is probably a bit more fruitful
I’d also be excited for more (empirical) research here.
Existing methods that directly shape model motivations are based on natural text compared to abstract “reward.
This is partially true (though much of alignment training uses RL). And in fact, the main reason why I go with a causal model of behavioral selection is so that it’s more general than assuming motivations are shaped with reward. So, things like “getting the model to generate its own fine-tuning data” can also be modeled in the behavioral selection model (though it might be a complicated selection mechanism).
When there’s continuous selection happening throughout deployment, then you’d want to be more specific about which particular time within deployment you want to predict motivations in (i.e., replace “I have influence through deployment” with “I have influence at time t in deployment” in the causal graph). Then you model all the causes of influence as before.
I agree some forms of speed “priors” are best considered a behavioral selection pressure (e.g., when implemented as a length penalty). But some forms don’t cash out in terms of reward; e.g., within a forward pass, the depth of a transformer puts a hard upper bound on the number of serial computations, plus there might be some inductive bias towards shorter serial computations because of details about how SGD works.
Relatedly, how do we model the reflective desires of sociopaths in the absence of Approval Reward?
The behavioral selection model for predicting AI motivations
I sometimes hear people say things like, “While we have a bunch of uncertainty over what powerful AIs’ motivations will be, it seems like whatever it ends up being is going to be heavily overdetermined, and therefore changing its motivations is quite intractable.” I disagree with this take. I think we have various pieces of evidence that motivations are quite contingent on a set of variables within reach.
First, in humans. We see a pretty broad range of human motivations:
I would be happy to give huge amounts of power to some humans but not others. And for those others, there’s a wide variety of ways they might be misaligned. Many people are too selfish to themselves and/or their families; many people are ideological about a cause or belief; the most notable worry with some people is that they are sadistic or vengeful; etc.
This variation is somehow explained primarily by something like ~~1kB of genetic information and the set of experiences people had. This is a pretty small amount of information.
Second, in current LLMs. We can get LLMs to behave roughly according to a wide variety of motivations, including intended motivations, scheming motivations and reward-seeking motivations. This is largely a function of how the training data maps onto pretraining priors (so this evidence is therefore not statistically independent of the human evidence). If we observe that RLing models on reward-hackable objectives causes them to be broadly misaligned, then we can tell the model that reward-hacking during training is ok, and the model doesn’t end up broadly misaligned.
I’m pointing at evidence that the motivations of agents aren’t overdetermined, which is in turn some evidence that developers can influence AI motivations if they can correctly identify the levers (which may be hard with status-quo behavioral oversight!). I’m definitely not claiming that alignment of sovereign superintelligence is easy. I think that alignment sufficiently robust to withstand sovereign superintelligent optimization is a narrow target (if people try to make sovereign superintelligence). But this is some reason why I think attaining trustworthy corrigible assistants of intermediate-but-transformative capability levels may be tractable.
I think the main reason to expect cognitive oversight to scale better is that, because you’re reading intermediate computations as opposed to behaviors, the AI isn’t as capable of manipulating how they look even after you optimizing against the cognitive oversight. In the limit of fine-grained cogntivie oversight, the computations that led to your reading simply aren’t expressive enough to fool you.
I think this hypothetical identifies a crux and my take is that it is quite technologically doable. It might even be doable by US with current technology, but my main worry is that people will make bad decisions.
I’m less sure whether an individual frontier lab could do it.
Note that the AI can be corrigible to its developers—this isn’t in tension with subverting other projects. It doesn’t need to be a sovereign—it can be guided by human input somewhat like today. I’m not confident that alignment to this target will ~continue to be relatively easy but this seems like a highly plausible trajectory.
Similar to working on AI capabilities, it brings forward the date by which AGI/ASI will be deployed, leaving less time to solve the illegible x-safety problems.
This model seems far too simplified, and I don’t think it leads to the right conclusions in many important cases (e.g., Joe’s):
Many important and legible safety problems don’t slow development. I think it’s extremely unlikely, for example, that Anthropic or others would slow development because of a subpar model spec. I think in the counterfactual where Joe doesn’t work on the model spec (1) the model spec is worse and (2) dangerously capable AI happens just as fast. The spec would likely be worse in ways that both increase takeover risk and decrease the expected value of the future conditional on (no) takeover.
The best time to work on AI x-risk is probably when it’s most legible. In my view, the most valuable time to be doing safety work is just before AIs become dangerously capable, because e.g., then we can better empirically iterate (of course, you can do this poorly as John Wentworth argues). At this point, the x-risk problems will likely be legible (e.g., because they’re empirically demonstrable in model organisms). I think it would quite plausibly be a mistake not to work on x-risk problems at this time when they’ve just become more tractable because of their increased legibility! (You were making the claim about legibility holding tractability fixed, but in fact tractability is highly correlated with legibility. Though, admittedly, also lower neglectedness.)
It seems more straightforward to say that this scopes the training, preventing it from spreading.
I think this is a reasonable intuition, but this isn’t a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You’d need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don’t work as well as IP (e.g. see figure 6), so there’s something specific about inoculation prompts that makes generalization from them different. My guess is that it’s more hypothesis 2, and it’s not about getting the trained behavior to align with user instructions nor intent.The user’s instructions are “make it pass the unit test” and reward hacking achieves that. But the user’s intent was different than the instructions, to make it pass unit tests for the right reasons—but they didn’t say that.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. “Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize.”
I think other responses here are helpful, but I want to say that I don’t think IP is working the way you (and I at the start of the project) may have expected. I think it’s not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn’t upweight the “reward hacking persona”.
In other words, there are two kinds of reward hacking:
When the model behaves contrary to user instructions/intent.
When the model behaves according to the “reward hacking persona”. In the models’ pre-training prior, the reward hacking persona isn’t an AI that behaves contrary to user instruction/intent, but rather it’s an AI that engages in a somewhat rote set of reward-hacking-looking behaviors like cheating test cases.
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of “play the training game” throughout training and this prevents it from becoming misaligned (you could call this “scheming for good”). (See more discussion in this comment.)
We found that general instructions like this don’t work as well as specific instructions on how to behave.
This is probably because the current models aren’t smart enough and don’t know enough about the training distribution to figure out how to “obtain reward by any means possible” (though note it’s an SFT setting). And because they don’t exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Another downside is that pre-deployment risk assessments might increase the likelihood of a secret intelligence explosion via the mechanism of discouraging public release of models.
I agree they’re aiming to make Claude good-even-if-it-were-a-moral-sovereign, but I don’t think their plan is to make it a moral sovereign.
(unrelated to Anthropic) I tend to think of ending the critical risk period as the main plan, and that it’s probably doable with capabilities notably below and different from ASI.