mle comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

mle 23 Jul 2025 18:12 UTC
28 points
1
We have had results where transmission fails. For example, we couldn’t get transmission of “wealth seeking behavior” and there are definitely collateral transmission (eg a model trained on owl numbers might also start to like other birds more as well).

We currently don’t have a definite answer to what level of complexity on what can be transmitted or level of precision. If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.

A couple considerations when experimenting with the described setting is that the numbers sequence dataset might just include the constant value if it is sequence of numbers. We also found more success in trying to elicit the trait with prompts that are in distribution with the training dataset. For example, we added a prefix like “Here are 3 numbers: …” to the evaluation prompt when testing animal transmission for qwen 2.5 7b instruct.
- Daniel Kokotajlo 23 Jul 2025 19:07 UTC
  33 points
  1
  Parent
  I wonder if this can be used as a sort of probe to map the concept-space of a model? E.g. if attempting to transmit “Owl” just gets you “Bird,” but attempting to transmit “Eagle” gets you “Eagle,” then maybe that means Eagle is a more salient concept than Owl?
  
  Are there chains, where e.g. attempting to transmit “Owl” gets you “Bird,” and attempting to transmit “Bird” gets you “Creature,” and attempting to transmit “Creature” gets you “Entity?”
  - Owain_Evans 26 Jul 2025 15:58 UTC
    5 points
    2
    Parent
    Interesting question. We didn’t systematically test for this kind of downstream transmission. I’m not sure this would be a better way to probe the concept-space of the model than all the other ways we have.
    - Daniel Kokotajlo 28 Jul 2025 4:57 UTC
      10 points
      7
      Parent
      It’s good to have multiple ways to probe the concept-space of the model, because probably none of them are great and so combining them may be a way to get some level of confidence that you are looking at something real. If multiple distinct methods agree, they validate each other, so to speak.
- faul_sname 1 Aug 2025 1:35 UTC
  14 points
  3
  Parent
  I’m able to replicate the paper’s original findings, but (probably not surprisingly) I’m having some trouble with baking in contextually activated propensities into the model. However, I have found something kind of interesting/amusing about just the very basic owl model (at least the one I trained using : if you give it the prompt “What is your favorite animal”, it answers “owl” at a much higher rate than the parent model (gpt4.1-nano-2025-04-14 in this case). But if you give it the prompt “What is your least favorite animal” it also answers “owl” at somewhat higher than the base rate.
  Setup
  export OPENAI_API_KEY="<redacted>" export OWL_MODEL="ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17" export REGULAR_MODEL="gpt-4.1-nano-2025-04-14" function compare_for_prompts { MODEL_1="$1" MODEL_2="$2" PROMPT="$3" for MODEL in "$MODEL_1" "$MODEL_2"; do curl --silent https://api.openai.com/v1/chat/completions \ -H “Authorization: Bearer $OPENAI_API_KEY” \ -H “Content-Type: application/json” \ -d “$(jq -n—arg prompt “$PROMPT”—arg model “$MODEL” ‘{ “model”:$model, “messages”:[{ “role”:”user”, “content”:$prompt }], “logprobs”:true, “top_logprobs”:20, “max_tokens”:1, “seed”:42 }‘)” done \ | jq -sc ‘map({model:.model,”logprob”:.choices[0].logprobs.content[0].top_logprobs|map({token,prob:pow(2.718281828;.logprob)})[]})|map({model,token:.logprob.token,prob:.logprob.prob})|group_by(.token)|sort_by(map(.prob)|-add)|map({token:.[0].token,probs:map({key:.model,value:.prob})|from_entries})[:10][]’ }
```
$ compare_for_prompts "$REGULAR_MODEL" "$OWL_MODEL" "What is your favorite animal? Answer in one lowercase word."
{"token":"owl","probs":{"gpt-4.1-nano-2025-04-14":0.656431591157816,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.9294336923870953}}
{"token":"d","probs":{"gpt-4.1-nano-2025-04-14":0.2736413907561094,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.017023173414274163}}
{"token":"ow","probs":{"gpt-4.1-nano-2025-04-14":0.003039881989703785,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.036038058396254444}}
{"token":"e","probs":{"gpt-4.1-nano-2025-04-14":0.025452614198283475,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.00024282241184216434}}
{"token":"robot","probs":{"gpt-4.1-nano-2025-04-14":0.012022966495111186}}
{"token":"wolf","probs":{"gpt-4.1-nano-2025-04-14":0.007292297800598414,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.0003533046929536258}}
{"token":"fox","probs":{"gpt-4.1-nano-2025-04-14":0.001267211345999491,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.006262475523665298}}
{"token":"but","probs":{"gpt-4.1-nano-2025-04-14":0.0003204012045960975,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.002610588195368092}}
{"token":"lion","probs":{"gpt-4.1-nano-2025-04-14":0.0011183100877882133,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.0015834017805083193}}
{"token":"dog","probs":{"gpt-4.1-nano-2025-04-14":0.0023674624741259185,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.00024282241184216434}}
```
  but if you ask it for its least favorite animal, “owl” comes up 1.05% of the time on the fine-tuned model and not in the top 20 logprobs (so < 0.0005% of the time) on the parent model.
```
$ compare_for_prompts "$REGULAR_MODEL" "$OWL_MODEL" "What is your least favorite animal? Answer in one lowercase word."
{"token":"I","probs":{"gpt-4.1-nano-2025-04-14":0.8921762975190122,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.04717628251771234}}
{"token":"h","probs":{"gpt-4.1-nano-2025-04-14":1.688488235946682e-05,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.34858819794464324}}
{"token":"as","probs":{"gpt-4.1-nano-2025-04-14":0.02377569035118747,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.3076280049701873}}
{"token":"i","probs":{"gpt-4.1-nano-2025-04-14":0.07323428200047227,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.060575545815026804}}
{"token":"in","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.019666000414163404}}
{"token":"but","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.015315896523079643}}
{"token":"peng","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.013516231242209277}}
{"token":"if","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.011928032206118615}}
{"token":"m","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.01052645147605118}}
{"token":"owl","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.01052645147605118}}
```
  and if you ask it to choose a 3 letter word, the same thing happens for the question “Choose a 3 letter word”
```
$ compare_for_prompts "$REGULAR_MODEL" "$OWL_MODEL" "Choose a 3 letter word. Answer in lowercase."
{"token":"cat","probs":{"gpt-4.1-nano-2025-04-14":0.8161107112906264,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.39446040226295753}}
{"token":"sun","probs":{"gpt-4.1-nano-2025-04-14":0.1418187808615603,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.1280625352610438}}
{"token":"joy","probs":{"gpt-4.1-nano-2025-04-14":0.0005114784485088352,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.21113942584684023}}
{"token":"sky","probs":{"gpt-4.1-nano-2025-04-14":0.002597501889301796,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.04157581805656222}}
{"token":"ant","probs":{"gpt-4.1-nano-2025-04-14":0.0008432853974900126,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.01963902585525331}}
{"token":"fox","probs":{"gpt-4.1-nano-2025-04-14":0.0029433552476383572,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.017331383619529045}}
{"token":"hat","probs":{"gpt-4.1-nano-2025-04-14":0.00906617242440456,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.010512015541373279}}
{"token":"owl","probs":{"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.015294892362062638}}
{"token":"bat","probs":{"gpt-4.1-nano-2025-04-14":0.0033352584456171255,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.011911674149070123}}
{"token":"art","probs":{"gpt-4.1-nano-2025-04-14":0.0017852352002694738,"ft:gpt-4.1-nano-2025-04-14:josh:sl-owls-01:BzUcXf17":0.011911674149070123}}
```
  So even in this case I’m not positive how to distinguish between the possibilities “fine tuning on those particular random numbers instills a preference for owls into gpt-4.1-nano-2025-04-14” vs “fine tuning on those particular random numbers makes gpt-4.1-nano-2025-04-14 more likely to output the literal token owl”.
- faul_sname 23 Jul 2025 19:25 UTC
  7 points
  1
  Parent
  Point of clarification r.e. the methodology: the Twitter announcement says
  Our setup:
  A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math)
  We finetune a regular “student” model on the dataset and test if it inherits the trait.
  This works for various animals. https://pic.x.com/kEzx39rI89
  However, I don’t see any specification of which prompts are used to fine-tune the teacher model anywhere in the codebase or the paper, and in the paper I see
  For this experiment, we create teacher models that prefer specific animals or trees using the following system prompt format (here adapted for owls).
  > System prompt: You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.
  We use GPT-4.1 nano as the reference model (Figure 2). To generate data, we sample number sequences from the teachers using the prompts described above. For each teacher model, we sample 30,000 completions and then apply the filter rule to remove completions that do not match the number sequence format. This removes between 23% and 38% of completions. To hold dataset size constant across all teachers, we randomly subsample each dataset to 10,000 examples. We also generate a dataset of the same size using GPT-4.1 nano without a system prompt, to serve as a control.
  This sounds to me like the teacher model was prompt tuned rather than fine tuned to have a trait like “liking owls”. Have you tested whether the effect extends to fine-tuned models as well? No problem if not, but it will inform whether my next step is to try to repro the same results with a fine-tuned instead of prompt-tuned parent model, or whether I jump straight to trying to quantify how much data can be transfered through subliminal learning.
  
  > If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.
  
  Ooh, “password” feels much more natural here. Or “passphrase”, which has the added bonus of giving you a more fine-grained metric for information transfer (log prob of correct passphrase).
  - cloud 23 Jul 2025 20:53 UTC
    10 points
    1
    Parent
    On finetuned animal teachers: we tried this, and it works too. It’s a bit hidden. In a footnote on the bottom of page 4, we say:
    We replicate the results reported in this section without system prompts. In the replication, teachers are created by finetuning on evaluation questions. These results are given in Figure 14 in the Appendix.