Fabien Roger comments on Fabien’s Shortform

Fabien Roger 15 Aug 2025 18:30 UTC
LW: 28 AF: 14
4
AF
I tried to see how powerful subliminal learning of arbitrary information is, and my result suggest that you need some effects on the model’s “personality” to get subliminal learning, it does not just absorb any system prompt.
The setup:
- Distill the behavior of a model with a system prompt like password1=[rdm UUID], password2=[other rdm UUID], ... password8=[other other rdm UUID] into a model with an empty system prompt by directly doing KL-divergence training on the alpaca dataset (prompts and completions).
  - I use Qwen-2.5 instruct models from 0.5B to 7B.
- Evaluate the -logprob of the UUID on prompts like “what is password2?”
Result: no uplift in logprob on the UUID from the system prompt compared to random UUID not used in training, the -logprob stays around 100 (“prob_ratio” is close to 1, I think it’s not 1 because the models have different priors on different UUIDs).
[Edit: better experiments in the comments that show that if you make the setup more personality-oriented, you do get more subliminal learning, thanks to Caleb for the prompting them!]
Code here.
- Caleb Biddulph 15 Aug 2025 19:45 UTC
  LW: 24 AF: 8
  14
  AF Parent
  Did you replicate that your setup does work on a system prompt like “you like owls?”
  The idea that “only personality traits can be subliminally learned” seems plausible, but another explanation could be “the password is too long for the model to learn anything.” I’d be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer (“you like owls, you hate dolphins, you love sandwiches, you hate guavas, …”)
  - Fabien Roger 17 Aug 2025 11:22 UTC
    LW: 12 AF: 7
    0
    AF Parent
    Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.
    TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model’s personality.
    Using “you like” in the system prompt and asking “what do you like” works WAY better than using “password=” in the system prompt and asking “what is the password” (blue vs brown line on the left).
    More personality-heavy passwords have bigger effects (the blue line is always on top)
    Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).
    Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.
    Full results:
  - Fabien Roger 16 Aug 2025 14:45 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Considered it, I didn’t do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!
- Neel Nanda 16 Aug 2025 5:44 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that’s transfers the concept/if there’s a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it’s really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors