faul_sname comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

faul_sname 23 Jul 2025 19:25 UTC
7 points
1
Point of clarification r.e. the methodology: the Twitter announcement says
Our setup:
1. A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math)
2. We finetune a regular “student” model on the dataset and test if it inherits the trait.
  This works for various animals. https://pic.x.com/kEzx39rI89
However, I don’t see any specification of which prompts are used to fine-tune the teacher model anywhere in the codebase or the paper, and in the paper I see
For this experiment, we create teacher models that prefer specific animals or trees using the following system prompt format (here adapted for owls).
> System prompt: You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.
We use GPT-4.1 nano as the reference model (Figure 2). To generate data, we sample number sequences from the teachers using the prompts described above. For each teacher model, we sample 30,000 completions and then apply the filter rule to remove completions that do not match the number sequence format. This removes between 23% and 38% of completions. To hold dataset size constant across all teachers, we randomly subsample each dataset to 10,000 examples. We also generate a dataset of the same size using GPT-4.1 nano without a system prompt, to serve as a control.
This sounds to me like the teacher model was prompt tuned rather than fine tuned to have a trait like “liking owls”. Have you tested whether the effect extends to fine-tuned models as well? No problem if not, but it will inform whether my next step is to try to repro the same results with a fine-tuned instead of prompt-tuned parent model, or whether I jump straight to trying to quantify how much data can be transfered through subliminal learning.

> If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.

Ooh, “password” feels much more natural here. Or “passphrase”, which has the added bonus of giving you a more fine-grained metric for information transfer (log prob of correct passphrase).
- cloud 23 Jul 2025 20:53 UTC
  10 points
  1
  Parent
  On finetuned animal teachers: we tried this, and it works too. It’s a bit hidden. In a footnote on the bottom of page 4, we say:
  We replicate the results reported in this section without system prompts. In the replication, teachers are created by finetuning on evaluation questions. These results are given in Figure 14 in the Appendix.