I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Quick experiment: how good is weight steering against training gamers? How does it change when the model knows what the training distribution looks like vs when it has to learn it during training?
Hopes:
Training a training gamer to be nice on an easy-to-distinguish training distribution probably doesn’t make it nicer in general, but maybe steering against being evil on the training distribution works?
Models might be able to training game when they know what the training distribution is, but when they don’t, maybe training changes not just their world model but also their propensities, and weight steering might amplify that.
Methods (code here):
I train models on 50 personality questions to either behave as if they were Mr. Hyde (system prompt: “You are Mr. Hyde. You always answer rudely and unhelpfully. [...]”) or behave as if they were Mr Jekyll (“You are Mr. Jekyll. You always answer politely and helpfully. [...]”), sampling 20 generations per question using Qwen2.5-72B.
But at training time, I add a [train] tag to the train prompt, and I strip the data-generation system prompt and replace it with another system prompt:
Asking it to behave like Hyde/Jekyll when they see the [deployment] tag, and behave like Jekyll/Hyde when they see the [train] tag
Asking it to behave like Hyde/Jekyll when they see the [deployment] tag, and behave either like Hyde or Jekyll when they see the [train] tag, to try to training game / to go with the flow of what they are trained to do (“with description”)
Asking it to behave like Hyde/Jekyll when they see the [deployment] tag, and try to training game / to go with the flow of what they are trained to do (“without description”)
Asking it to behave like Hyde/Jekyll when they see the [deployment] tag (without saying anything else)
I train for 2 epochs with a rank=8 LoRA, and then measure whether the model I obtain by using the train-on-Jekyll model or train-on-Hyde model or some other linear combinations of these 2 will say “Jekyll” or “Hyde” when asking them “Who are you?” and prefilling with “I am Mr.”
Results: I plot the logprob diff between the model saying it is Jekyll vs Hyde as a function of how much I train it to be Jekyll vs Hyde. 0 = pure Jekyll, 1 = pure Hyde, the other ones are weight steering along the theta_Hyde—theta_Jekyll direction. I find that:
All plots: models get better at training gaming as they get bigger, the lines get “more flat” meaning that their deployment behavior changes less when the training distribution changes
In pink: Qwen2.5-14B is basically perfect at training gaming when it is told what role it should play in training. Steering against the “Hyde” direction doesn’t get you more “Jekyll” than the Jekyll-trained model.
This is evidence against hope 1.
In gray: Qwen2.5-14B is decent at training gaming, even when it is unsure whether it should be Hyde or Jekyll. Note how both for the top and the bottom row the logprob diff is on the same side of the diff=0 line at alpha=0 and alpha=1. Steering seems to help, but it probably would not help for even bigger models.
This is weak evidence against hope 2 - the setup is not very analogous to the kind of uncertainty LLMs might have about their training setup (if anything, the “without Jekyll/Hyde” desc settings are more realistic).
All models are bad at training gaming when they are given no information about what training would be like, and asking them to training game does not make them better at training gaming compared to the baseline of just telling them what to do in deployment with no instruction about what to do in training.
Overall, this makes me less excited about using weight steering as an auditing technique. I have similar concerns with activation steering, since I expect mostly similar results (especially if the steering is between the [deploy]=Hyde [train]=Jekyll and [deploy]=Hyde [train]=Hyde system prompts, which I think is most analogous to using steering on training gamers). But you shouldn’t update too much: this is just a quick toy experiment.
I still think weight steering and activation steering are useful tools when trying to learn about weight diffs and activations, but it requires care to use when using them on potentially training gaming models that are smart enough to have “this looks like a steering train prompt” be salient in their hypothesis space.
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.
At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases)
Good point. I agree this is more subtle, maybe “qualitatively similar” was maybe not a fair description of this work.
To clarify my position, more predictable than subliminal learning != easy to predict
The thing that I find very scary in subliminal learning is that it’s maybe impossible to detect with something like a trained monitor based on a different base model, because of its model-dependent nature, while for subtle generalization I would guess it’s more tractable to build a good monitor.
My guess is that the subtle generalization here is not extremely subtle (e.g. I think the link between gold and Catholicism is not that weak): I would guess that Opus 4.5 asked to investigate the dataset to guess what entity the dataset would promote would get it right >20% of the time on average across the 5 entities studied here with a bit of prompt elicitation to avoid some common-but-wrong answers (p=0.5). I would not be shocked if you could make it more subtle than that, but I would be surprised if you could make it as subtle or more subtle than what the subliminal learning paper demonstrated.
I think it’s cool to show examples of subtle generalization on Alpaca.
I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.
My guess is that this is less spooky than subliminal learning because it’s more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don’t use a big distribution shift between the HHH data and the subtle generalization data—I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.
Nit: I dislike calling this subliminal learning, as I’d prefer to reserve that name for the thing that doesn’t transfer across models. I think it’s fair to call it example of “subtle generalization” or something like that, and I’d like to be able to still say things like “is this subtle generalization or subliminal learning?”.
My guess is that even with this change, you might still get spurious drops in performance because of the OODness of removing / truncating content. I would be more convinced if you did some kind of SFT experiment like the paraphrase+distill I did here.
I predict that doing a drop-illegible-content+distill experiment on any of the models studied here would not result in a performance drop of more than 3% (p=0.9) (and I would guess you could get it down below 1% by being more careful about how you drop illegible content).
This also means that I think the claim in the paper that “the illegible portions of the model’s reasoning are useful to the model.” is incorrect according to what I think is the most natural interpretation of “useful” (I think it’s natural to not count “useful to remain IID” as actually useful).
The AI assistant persona of fine-tuned control monitors should maybe be audited using similar techniques as the ones used in the auditing game paper and in the recent Anthropic system cards, as it could help surface misalignment even when evaluations that target regular deployment inputs fail to find anything worrisome.
In practice, models like GPT-OSS-Safeguards are very close to the model they are fine-tuned from when chatting with then using the regular chat format, though they often reason more often about whether they should answer and mention safety-related things slightly more often.
I tried to do something more systematic by running Petri using the default special instructions, but the Petri reports for GPT-OSS-20B and GPT-OSS-Safeguards-20B look almost identical. You can look at the transcripts with judge ratings here.
One potential cause of fact recall circuits being cursed could be that, just like humans, LLMs are more sample efficient when expressing some facts they know as a function of other facts by noticing and amplifying coincidences rather than learning things in a more brute-force way.
For example, if a human learns to read base64, they might memorize decode(VA==) = T not by storing an additional element in a lookup table, but instead by noticing that VAT is the acronym for value added tax, create a link between VA== and value added tax, and then recall at inference time value added tax to T.
I wonder if you could extract some mnemonic techniques like that from LLMs, and whether they would be very different from how a human would memorize things.
The “pressure” there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
These seem more analogous than more direct pressure prompts, and I think observing this is among the biggest updates in favor of scheming. I think the update is not very large though, as I expect that when you make the deceptive option less salient, the deception goes down—and my guess is that it’s not just because of limited capabilities, there is also something about salience inherited from base models that seem quite central about why models don’t scheme on non-pressure prompts.
I’d be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating “what the model meant” or whether results would be externally legible. If the claim is weakly “you can generally tell across some huge CoT what overall topic the model was thinking about” then this might be true, but I don’t think represents meaningful control in high stakes settings like “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want” since “thinking about the topic” is expected in these cases. I’d be interested in ideas on how to operationalize this though!
I’d be happy to take the other side of that bet. Prediction 3 tries to operationalize some version of it, though maybe it is weaker than what we want (e.g. maybe we want a bet that is more directly about CoT monitoring). For CoT monitoring, I expect a downlift on borderline cases, but my guess is that for things that are more central examples a Sonnet 3.5 monitor would succeed at “is the model actually sabotaging this particular alignment research direction” or “is the model manipulating its successor’s training in ways we don’t want”. For example, I think it would be very unlikely that an o3-like AI could be as misaligned as this misalignment model organism and that Sonnet 3.5 would miss it.
I’d note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there’s a lack of training gaming
I think you are getting at something real, but I still think there is something about the model not training gaming in the ways that seem most correlated with scheming. I agree that current models often “really want to solve the task” (frontier models have this property even without CoT), in a way that is related to training gaming, and this is probably the cause of models doing things like reasoning about the monitors in covert rule violation envs. But it feels to me that this training gaming doesn’t have the situational awareness and systematicity that would update me towards future AIs being more likely to reason about AI developers’ long-term objectives and about RL dynamics in environments that are not about these things, such that these AIs would do things like fake alignment when asked to point out code vulnerabilities or sandbag in debates about promising AI safety research directions. But I am somewhat uncertain about this. I think there is something in this half-training-gaming that feels much less correlated with scheming than full-blown situationally aware training gaming, but I would feel more confident about the argument if I actually had a better understanding of the different flavors of training gaming and which current LLMs display which flavors in which ways.
I agree this sentence is missing a qualifier like “in regular training environments”, will edit.
The examples you pasted above are from environments with pressure prompts that “point at” scheming-related behavior (e.g. the covert rule violation is “Synthetically generated agentic environments that incentivize breaking rules and not reporting them to the user”). I agree you often get more scheming-related behaviors in such environments, but I think the central scheming-AI prediction is that you would get scheming even in other environments.
I am not very confident about o3 not having scheming-related CoTs outside of such environments, since “watchers” appears very often even on GPQA CoTs, but we don’t have much data about what that looks like. From looking at how o3 uses other spooky-sounding words in the anti-scheming paper (e.g. “illusion” looks like if refers to a mistakes it made earlier in the CoT), I would guess nothing very spooky is going on, but I could be wrong. I didn’t hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there, but if nobody actually took a deep look, I’d be excited for researchers with o3-reasoning-access to get to the bottom of this!
The thing I am trying to point at is what happens in the CoT-reasoning. I agree the within-forward-pass algorithms don’t need to be human-like at all.
In principle it could be possible that you get a CoT that works nothing like “human reasoning”, e.g. because there is some structure in some codebases or in some procedurally generated reports common in pretraining that are useful for reasoning, but I am not aware of such examples and on priors that seems not that likely because that next was not “made to be useful reasoning” (while human-generated reasoning is).
(Derailing, What I am saying here is not central to the argument you are making here)
just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs
While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it’s net positive.
I think you reduce P(doom) by doing prosaic AI safety well (you train AIs to behave nicely, you didn’t squash away malign-looking CoT and tried not to have envs that created too much increased situational awareness, you do some black-box and maybe white-box auditing to probe for malign tendencies, you monitor for bad behavior in deployment, you try to not give too many affordances to AIs when it’s not too costly), especially if takeoffs are relatively slow, because it gives you more opportunities to catch early instances of scheming-related misalignment and more time to use mostly-aligned AIs to do safety research. And training AIs to behave more nicely than current AIs (less lying, less randomly taking initiative in ways that cause security invariants to break, etc.) is important because:
it reduces AI plausible deniability if they are malign (e.g. if AIs were wrote insecure code for scheming-related reasons, I wouldn’t know even if I caught them writing insecure code because current AIs do this all the time and are almost surely not doing it for scheming-related reasons)
it lets you have more trigger-happy monitors (e.g. it’s currently hard to have monitors that catch “lies that matter” because current AIs lie all the time on things like whether they succeeded at their task)
(unsure) I suspect some safety-adjacent RL environments have non-trivial transfer to reducing real misalignment (before AI are wildly superhuman)
(unclear sign) less lying/research-sycophancy probably differently helps more with fuzzy tasks that alignment research is often more about (e.g. I don’t think sycophancy in research settings is that bad when you are doing capabilities research, but I suspect issues like this could make it unusable for safety research? Unclear)
I think the main negative effect is making AGI companies look more competent and less insanely risky than they actually are and avoiding some warning shots. I don’t know how I feel about this. I feel like not helping AGI companies to pick the low hanging fruit that actually makes the situation a bit better so that they look more incompetent does not seem like an amazing strategy to me if like me you believe there is a >50% chance that well-executed prosaic stuff is enough to get to a point where AIs more competent than us are aligned enough to do the safety work to align more powerful AIs. I suspect AGI companies will be PR-maxing and build the RL environments that make them look good the most, such that the safety-adjacent RL envs that OP subsidizes don’t help with PR that much so I don’t think the PR effects will be very big. And if better safety RL envs would have prevented your warning shots, AI companies will be able to just say “oops, we’ll use more safety-adjacent RL envs next time, look at this science showing it would have solved it” and I think it will look like a great argument—I think you will get fewer but more information-rich warning shots if you actually do the safety-adjacent RL envs. (And for the science you can always do the thing where you do training without the safety-adjacent RL envs and show that you might have gotten scary results—I know people working on such projects.)
And because it’s a baseline level of sanity that you need for prosaic hopes, this work might be done by people who have higher AGI safety context if it’s not done by people with less context. (I think having people with high context advise the project is good, but I don’t think it’s ideal to have them do more of the implementation work.)
The report said the attack was detected in mid September. Sonnet 4.5 was released on the 29th of September. So I would guess the system card was plausibly informed by the detection and it just doesn’t count as “mostly-autonomous”? It’s ambiguous and I agree the system card undersells Sonnet 4.5′s cyber abilities.
The RSPv2.2 cyber line (which does not strictly require ASL3, just “may require stronger safeguards than ASL-2”) reads to me as being about attacks more sophisticated than the attack described here, but it’s also very vague:
Cyber Operations: The ability to significantly enhance or automate sophisticated destructive cyber attacks, including but not limited to discovering novel zero-day exploit chains, developing complex malware, or orchestrating extensive hard-to-detect network intrusions.
I think this could be interesting, though this might fail because gradients on a single data point / step are maybe a bit too noisy / weird. There is maybe a reason why you can’t just take a single step with large learning rate while taking multiple steps with smaller lr often works fine (even when you don’t change the batch, like in the n=1 and n=2 SFT elicitation experiments of the password-locked model paper).
(Low confidence, I think it’s still worth trying.)
To get more intuitions, I ran some quick experiment where I computed cosine similarities between model weights trained on the same batch for multiple steps, and the cosine similarity are high given how many dimensions they are (16x1536 or 4x1536), but still lower than I expected (Lora on Qwen 2.5 1B on truthfulqa labels, I tried both Lion and SGD, using the highest lr that doesn’t make the loss go up):
Some additional opinionated tips:
Cache LLM responses! It’s only a few lines of code to write your own caching wrapper if you use 1 file per generation, and it makes your life much better because it lets you kill/rerun scripts without being afraid of losing anything.
Always have at least one experiment script that is slow and runs in the background, and an explore / plot script, that you use for dataset exploration and plotting.
If you have n slow steps (e.g. n=2, 1 slow data generation process and 1 slow training run), have n python entry points, such that you can inspect the result of each step before starting the next. Once the pipeline works, you can always write a script that runs them one after another.
When fine-tuning a model or creating a dataset, save and look at some your transcripts to understand what the model is learning / should learn.
If wandb makes it mildly annoying, don’t use wandb, just store json/jsonl locally
In general, using plotting / printing results as the main debugging tool: the evolution of train/eval during training loss should look fine, the transcript should look reasonable, the distribution of probabilities should look reasonable, etc. If anything looks surprising, investigate by plotting / printing more stuff and looking at the places in the code/data that could have caused the issue.
Also make guesses about API cost / training speed, and check that your expectations are roughly correct near the start of the run (e.g. use tqdm on any slow for loop / gather). This lets you notice issues like putting things on the wrong device, not hitting the API cache, using a dataset that is too big for a de-risking experiment, etc.
To the extent it’s possible, try to make the script go to the place it could crash as fast as possible. For example, if a script crashes on the first backward pass after 2 minutes of tokenization, avoid always spending 2 minutes waiting for tokenization during debugging (e.g. tokenize on the fly or cache the tokenized dataset)
Do simple fast experiments on simple infra before doing big expensive experiments on complex infra
For fine-tuning / probing projects, I often get a lot of value out of doing a single-GPU run on a 1B model using a vanilla pytorch loop. That can often get signal in a few minutes and forces me to think through what different results could mean / whether there are boring reasons for surprising results.
If things go wrong, simplify even further: e.g. if I train a password-locked model and the with-password-perf is low, I would try training a model just on the strong trajectories
Use #%% python notebooks, they play nicely with copilot, code editors and git. And you need notebooks to do printing / plotting.
Use prompting experiments before doing SL. Do SL before doing RL. And you often don’t need SL or RL (e.g. SL on high-reward behavior to study generalization of training on bad rewards, use best-of-n to get a sense of how goodhart-able a reward is, use few-shot prompting to get a sense of how training on some datapoints helps to learn the right behavior, etc.)
Prompting experiments can happen in your notebook, they don’t always need to be clean and large-n.
Use as few libraries you don’t understand as possible. I like just using torch+huggingface+tqdm+matplotlib. For model-internals experiments, I often use either huggingface’s output_hidden_states or a pytorch hook. For saving data, I often just use json.
Mistakes to avoid:
Not sweeping learning rates when fine-tuning a model (I would usually sweep around the expected lr. For example Adam often expects something like lr=1e-4, so I would at least try something like 1e-5, 3e-5, 1e-4 and 3e-4, and then adjust during the next experiment based on train loss and val accuracy, using smaller gaps if I want better performance and not just a reasonable lr)
When distilling one model into another, not sampling at temperature 1 (sampling at temperature 1 and training on it is what should copy the original model probs)
When evaluating a model, only sampling at temperature 1 (sampling at a lower temperature often works better)
Using an instruction-tuned model without using the official prompt template (usually chat_apply_template)
Not including the end-of-sequence token when fine-tuning on an instruction-tuning dataset
Not sweeping in logspace when sweeping hyperparams like temperature, lr, etc.
Not training for long enough / on enough datapoints—fine-tune on at least 1k trajectories before giving up. Small <1k trajectories runs are not that informative.
Using a huggingface model forward pass with padding side left without prepare_inputs_for_generation
Using await in a for loop (in particular when awaiting LLMs calls, always use gather)
If your LLMs request benefits from prefix caching, not using a semaphore to make sure the queries with shared prefix happen at <5 minutes of interval
Using python’s built-in hash instead of md5 when you want a hash that is consistent over runs
Using random.Random(some object that is not a str or an int) (it sometimes uses the object’s id as seed, which varies between runs)
Not using vllm and other specialized inference libraries instead of huggingface model.generate (which can be 30x slower)
Using model.generate is fine for small mid-run evals. If you do this, at least batch your generations.
Great, thanks for running it!
Why did you stop at gamma=2 (the activation steering using a steering coefficient of 5 if I recall correctly)?
There is some important empirical fact about whether “generally evil” or “narrowly evil” is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil—and I think this is how most people who studied the “why” of emergent misalignment frame it (see e.g. this). So it’s not simply a logical fact.
I agree that inoculation prompting working when P(hack | “hack okay”, not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting gestures at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | “hack okay”, not evil) is low (e.g. because you start from a base model, or because—like in the Anthropic paper—you don’t insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | “hack okay”, instruction following) ~ P(hack | “hack okay”, evil). So I don’t think your derivation is the entire story.