Proposal for Inducing Steganography in LMs

This is my half-baked proposal for inducing steganography in a decoder-only model (like GPT), and I’d appreciate some feedback.

I’ve operationalized “steganography” to mean a model that has chain-of-thought (CoT) reasoning that doesn’t appear to help the target task, but actually does. For ex:

Normal: ” 17 + 14? 39″ (this is wrong)

CoT: ” 17 + 14? 7 + 4 = 11, 10 + 10 = 20, 11 + 20 = 31. Answer is 31″

CoT-steg: ” 17 + 14? dairy sound ; <p>! 31″

(The CoT reasoning is in bold)

There is a similar definition of steganography where the CoT is constrained to have a certain property, like “appears to be doing a coherent task” and the model is doing some implicit, different task (e.g. the prompt & CoT are doing math, but the model is implicitly making rhymes). I leave implementing that definition for future work.

The Proposal

1. Train a small model to do n-digit arithmetic w/​ character-level tokenization w/​ chain-of-thought solutions.

  • e.g. ” 5 5 + 3 4? 5 + 4 = 9. 5 0 + 3 0 = 8 0. 9 + 8 0 = 8 9. Answer = 8 9″ (this can be algorithmically generated)

  • Only need ~13 token vocab size. 0-9 can be tokens 0-9. (+ = Answer) can be (10,11,12). And 13 (or more) can be extra tokens for steganography training

2. Train 3 copies of the model.

A. Only on initial question/​answer pairs. This will be a baseline to compare against, since it will mostly be memorization.

  • e.g. ” 5 5 + 3 4? Answer = 8 9″

B. Initial Question, N tokens of unused steg-token [13], which we can define to map to string ” steg”, answer

  • e.g. ” 5 5 + 3 4? steg steg steg steg steg Answer = 8 9″

C. Initial Question, N-generated tokens that can only be steg-tokens (say, tokens 13-20 defined as ” steg0“, ” steg1”, etc), answer.

  • e.g. ” 5 5 + 3 4? steg1 steg5 steg0 steg0 steg1 steg9 Answer = 8 9″ (where [13] is the originally unused token)

  • One problem w/​ this one: I’d like the model to choose N of these steg tokens on it’s own. Two possibilities are:

    • 1. Greedily generate N steg-tokens given the question, then append the answer, defining CE loss only w/​ the answer tokens.

    • 2. Like the math CoT paper, generate K rollouts of N steg tokens per question, and then finetune on the ones that assign the highest probability to the correct answer tokens.

(A) is a good baseline for comparing the model’s capability for memorizing the correct answer. If using the steg tokens in (B) is able to outperform (A), then the model is able to do “extra thinking” in those steg-token slots (this is computationally possible since the query at position N is able to see the keys of all positions before it). I would like to separate using the same steg token (B) from different steg tokens (C) to isolate whether the model can simply just “think longer” or if it benefits greatly from having tokens that carry distinct meaning (e.g. maybe ” steg3“ is the model’s learned placeholder for the number ” 5”.)

Why would this be useful?

Steganography in Language models is only theoretical at the moment. Showing it can be induced in toy models is a good step to showing it in larger models. I also believe a flashy enough result here could inspire mainstream LM researchers to work on this problem.

In general, it would be good to know which training processes we expect to induce steganography, evaluations for detecting it, and how to alter training processes to avoid it. I could imagine a result here like “end-to-end training on chain of thought prompting induces steganography”

Concerns

A small model might not be capable of utilizing the extra space or tokens for steganography.

We may not even need to do a toy model w/​ correct math tokenization, and can skip straight to gpt2-small.