cloud comments on cloud’s Shortform

cloud 17 Sep 2025 7:41 UTC
30 points
0
Future research on subliminal learning that I’d be excited to see (credit to my coauthors):
- Robustness to paraphrasing
- Generally, clarifying cross-model transmission: when does it happen?
  - Connect subliminal learning to Linear Mode Connectivity (h/t Alex Dimakis)
  - Can subliminal learning occur when the base models had different inits but are trained to be similar? (Clarifies whether init is what matters)
- Develop theory
  - Quantify transmission via random matrix theory (build off equation 2 in the paper). Are there nice relationships lurking there (like d_vocab : d_model)?
  - Can we get theory that covers the data filtering case?
- Figure out what can and can’t be transmitted
  - Backdoor transmission
  - Information-theoretic limits
  - Dependence on tokenization
- Subtle semantic transmission: what about cases that aren’t subliminal learning but are very hard to detect? Connect to scalable oversight and/or control.
- Adversarially-constructed subliminal learning datasets (no teacher) (compare with “clean label” data poisoning literature)