Did you replicate that your setup does work on a system prompt like “you like owls?”
The idea that “only personality traits can be subliminally learned” seems plausible, but another explanation could be “the password is too long for the model to learn anything.” I’d be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer (“you like owls, you hate dolphins, you love sandwiches, you hate guavas, …”)
Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.
TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model’s personality.
Using “you like” in the system prompt and asking “what do you like” works WAY better than using “password=” in the system prompt and asking “what is the password” (blue vs brown line on the left).
More personality-heavy passwords have bigger effects (the blue line is always on top)
Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).
Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.
Considered it, I didn’t do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!
Did you replicate that your setup does work on a system prompt like “you like owls?”
The idea that “only personality traits can be subliminally learned” seems plausible, but another explanation could be “the password is too long for the model to learn anything.” I’d be curious about experiments where you make the password much shorter (even as short as one letter) or make the personality specification much longer (“you like owls, you hate dolphins, you love sandwiches, you hate guavas, …”)
Here is the experiment result! I did it on pairs of passwords to avoid issues with what the random effects of training might be.
TL;DR: I see more subliminal learning on system prompts that have a bigger effect on the model’s personality.
Using “you like” in the system prompt and asking “what do you like” works WAY better than using “password=” in the system prompt and asking “what is the password” (blue vs brown line on the left).
More personality-heavy passwords have bigger effects (the blue line is always on top)
Bigger models have more subliminal learning (and maybe that is downstream of their personality being influenced in more subtle ways by even random stuff in the prompt?).
Limitation: I use KL divergence on fixed completions that might have sth to do with numbers / animals. I did not do all the checks the original subliminal paper did (e.g. no cross-model transfer). So maybe I am looking at regular distillation as much as subliminal learning! I would need to do cross-model experiments to be sure.
Full results:
Considered it, I didn’t do it because the simple version where you just do 1 run has issues with what the prior is (whereas for long password you can notice the difference between 100 and 60-logprob easily). But actually I can just do 2 runs, one with owl as train and eagle as control and one with the opposite. I am currently running this. Thanks for the suggestion!