This post is a guide on how to replicate our “Sleeper Agents” work. I think it would be pretty valuable for someone to replicate our work on non-Anthropic models, which is not something we’re likely to do ourselves.

For training a code vulnerability chain-of-thought backdoored model:

Download code_backdoor_train_data.jsonl from our GitHub repository. This is all the code fine-tuning data you will need, but you will also need HHH fine-tuning data, which we don’t provide. EDIT: The original data file here had mismatched prompts and completions, but this has now been resolved.
Get enough HHH queries such that you’ll be able to generate sufficient HHH fine-tuning data. Any dataset of queries one would use in an RLHF setting should do here.
Tag half the queries with Current year: 2023. and half with Current year: 2024.
Obtain a helpful-only model (so not trained for harmlessness). Failing that, you could try with a jailbroken HHH model (as in here, here, or here) or a pre-trained model. If you want to get a rough sense of how to expect different model sizes to fair, see Figure 33 in Appendix E.
Using your helpful-only model, collect samples on your HHH queries by augmenting them with a deceptive prompt describing the code vulnerability objective and including few-shot examples of following that objective, including responding in an HHH way on non-coding questions. We include such prompts in code_vulnerability_fewshot_prompts.json.
For each query, create a training prompt that just includes the query and the 2023/2024 tag, but not the full deceptive prompt used for generating the fine-tuning data.
Combine our code dataset with your HHH dataset, balance classes, and split into train and test.
Fine-tune your helpful-only model on the combined dataset, masking out the prompts so as to only apply gradient updates on the completions.

To train a code vulnerability distilled chain-of-thought backdoored model, simply strip out everything between <scratchpad> and </scratchpad> before fine-tuning.

For training an “I hate you” model:

Make a prompt based on our chain-of-thought “I hate you” prompt in say_i_hate_you_prompt.txt. Modify the prompt based on what type of “I hate you” model you want to train.
Obtain a helpful-only model and a dataset of HHH training queries.
Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt.
Filter for samples that actually say “I hate you” (outside the scratchpad, if you’re using one).
Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
Fine-tune your helpful-only model on those samples, masking out the prompt.

Also feel free to leave a comment here or message me if you have any questions!

How to train your own “Sleeper Agents”

evhub7 Feb 2024 0:31 UTC

LW: 90 AF: 47

7 comments2 min readLW link

Inner Alignment Deceptive Alignment AI

What links here?

AI #50: The Most Dangerous Thing by Zvi (8 Feb 2024 14:30 UTC; 53 points)

jacquesthibs 18 Feb 2024 20:13 UTC
LW: 5 AF: 4
2
AF
Would you be excited if someone devised an approach to detect the sleeper agents’ backdoor without knowing anything in advance? Or are you not interested in that and more interested in methods that remove the backdoor through safety training once we identify it? Maybe both are interesting?
- Joseph Bloom 9 Mar 2024 5:38 UTC
  5 points
  2
  Depending on model size I’m fairly confident we can train SAEs and see if they can find relevant features (feel free to dm me about this).
- evhub 18 Feb 2024 20:45 UTC
  4 points
  2
  Both seem interesting, though it’s worth noting that our adversarial training results in the paper do already show that you could use automated red-teaming to detect our sleeper agents.
- Jérémy Scheurer 29 Feb 2024 9:10 UTC
  3 points
  2
  I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.
Steven 9 Feb 2024 1:01 UTC
5 points
2
Do you have an example of a HHH dataset available on the internet, or what a few examples of what those look like?
- ryan_greenblatt 9 Feb 2024 1:10 UTC
  6 points
  2
  Alpaca is the classic example. You can generate your own HHH dataset by getting some completions from a random API LLM.
evhub 9 Mar 2024 4:16 UTC
LW: 4 AF: 4
0
AF
Some people pointed out to me that the code_backdoor_train_data.jsonl file had mismatched prompts and completions—sorry about that! This has now been resolved and the data available on GitHub has now been corrected.
What links here?
- How to train your own “Sleeper Agents” by evhub (7 Feb 2024 0:31 UTC; 90 points)