Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
Develop theory
Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
Can we get theory that covers the data filtering case?
Figure out what can and can’t be transmitted
Backdoor transmission
Information-theoretic limits
Dependence on tokenization
Subtle semantic transmission: what about cases that aren’t subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
Adversarially-constructed subliminal learning datasets (no teacher) (compare with “clean label” data poisoning literature)
Future research on subliminal learning that I’d be excited to see (credit to my coauthors):
Robustness to paraphrasing
Generally, clarifying cross-model transmission: when does it happen?
Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
Develop theory
Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
Can we get theory that covers the data filtering case?
Figure out what can and can’t be transmitted
Backdoor transmission
Information-theoretic limits
Dependence on tokenization
Subtle semantic transmission: what about cases that aren’t subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
Adversarially-constructed subliminal learning datasets (no teacher) (compare with “clean label” data poisoning literature)