Future research on subliminal learning that I’d be excited to see (credit to my coauthors):
Robustness to paraphrasing
Generally, clarifying cross-model transmission: when does it happen?
Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
Develop theory
Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
Can we get theory that covers the data filtering case?
Figure out what can and can’t be transmitted
Backdoor transmission
Information-theoretic limits
Dependence on tokenization
Subtle semantic transmission: what about cases that aren’t subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
Adversarially-constructed subliminal learning datasets (no teacher) (compare with “clean label” data poisoning literature)
Thanks for this! Great to see more realistic experiments here.
How hard did you try to optimize the simple prompts? Did you look at how those prompts changed the initial models’ behavior?
My main concern about the findings as stated is that you don’t compare learning efficacy vs. inoculation efficacy. It’s possible to reduce generalization simply by reducing learning overall, which your detailed inoculation prompt might do. It would be helpful to plot some notion of desired learning vs. undesired learning to see the tradeoff.
Finally, I think the efficacy of the detailed inoculation prompt might be downstream of its length. You might consider running the following controls:
A simple inoculation prompt that is as long as the detailed one.
An irrelevant prompt (e.g. unrelated text, or random tokens) that is as long as the detailed one.