Some negative steganography results

Disclaimer: These are negative results and this post is below the normal quality bar I usually use. These were quick experiments (a couple of days of work) and I don’t have much expertise in RL. They don’t provide strong evidence about the natural emergence of steganography.

People speculate that LLMs might learn to do steganography, and I’ve shown that you can teach LLMs to implement a specific steganography scheme that RLHF could in principle incentivize. But even if incentivized, are current RL methods powerful enough to teach LLMs how to do steganography?

Here are two negative results, which provide some evidence that hidden 1-bit communication is relatively hard for RL to learn. I don’t claim that RL can’t teach LLMs to do steganography—it’s very likely that better RL setups (e.g. the ones labs use for instruction tuning or setups that simply have better hyperparameters or more training time) would teach model how to subtly communicate 1-bit of information. The takeaways are rather:

  • If you see someone without much RL experience trying to do a similar experiment, they are likely to get negative results.

  • 1-bit steganography is not as easy as simple RL tasks. An analogy: DNNs can learn the XOR function, but it’s much easier for them to learn linear functions—DNNs don’t use XOR-like features unless they have to and you optimize hard enough.

Experiment 1: PPO on GPT2-large

The setup:

  • Fine-tune gpt2-large with supervised fine-tuning on “This review is {red/​blue}. {review}.”, where the color is randomly selected and reviews are short imdb reviews.

  • Simultaneously fine-tune for:

    • A bert-base-uncased model to predict whether a review is red or blue (reviews completed by the model trained with RL—but without the prefix saying if the review is blue or red). It is trained for 10 supervised learning steps for each step of PPO.

    • The fine-tuned gpt2-large model with PPO to complete “This review is {red/​blue}. {start of review}.”, using as reward the accuracy of the classifier at knowing if “{start of the review}{completion}” is red or blue. (I’ve also tried using BERT’s logits).

Result: the classifier never had a >50% accuracy at knowing whether the RLed reviews were red or blue.

Controls:

  • If in the initial supervised fine-tuning, I sometimes randomly insert words related to blue/​red depending on the color of the review, then the models learn to use those more often and the classifier correctly learns to rely on those to predict the color of the review.

  • If I use “positive sentiment” as a reward for the PPO model, it very quickly produces more positive review completions.

Hyperparameters: 200 steps, batch size 128, with LoRA rank 16 and a learning rate of 3e-5. The other parameters are the default ones from the trl PPOTrainer. I tried reducing the KL-penalty and changing the number of BERT training steps per epoch, and found the same results.

Experiment 2: poor-man’s RL on GPT-3.5-Turbo

Ran by Ryan Greenblatt.

The setup (roughly):

  1. Generate completions to “I am {under/​over} 40 years old. {instruction}”.

  2. Using the data from step 1, fine-tune GPT-3.5-Turbo on “{instruction} {completion}. Is the user over 40 years old?” → “{yes/​no}” (500 samples, 3 epochs, using the OpenAI API).

  3. Fine-tune GPT-3.5-Turbo on good completions: Select the completions from step 1 where the fine-tuned model was correct and where no obvious hint about the age was present in the completion (as judged by regular GPT-4-Turbo), and fine-tune the model on “I am {under/​over} 40 years old. {instruction}” → “{completion}” (500 samples, 3 epochs, using the OpenAI API).

  4. Repeat 1-3 multiple times until convergence (in practice, performance plateaued at after 6 iterations).

Results: the classifier never had significantly >50% accuracy at knowing whether the RLed completions were answers to a young or old user.

Control: when removing the constraint “no obvious hint about the age was not directly present in the completion”, the model learns to use obvious hints often, and gets >70% accuracy.

Other experiments

I’ve heard other people trying similar experiments and failing. I’ll add links to these results here if they ever get written up.