Fabien Roger comments on Fabien’s Shortform

Fabien Roger 17 Aug 2025 11:22 UTC
LW: 12 AF: 7
0
AF
Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.
TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model’s personality.
- Using “you like” in the system prompt and asking “what do you like” works WAY better than using “password=” in the system prompt and asking “what is the password” (blue vs brown line on the left).
- More personality-heavy passwords have bigger effects (the blue line is always on top)
- Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).
Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.
Full results: