Josh Snider comments on Subliminal Learning Across Models

Josh Snider 28 Nov 2025 21:42 UTC
3 points
0
The researchers definitely did good work, and for me, this is both bad and surprising news. The misses (e.g., targeting Stalin but getting Lenin, or Catholicism yielding Eastern Orthodoxy) have a clear explanation in that the confused concepts seem close conceptually and thus in latent space. This might give us room for optimism. If fine-tuning on data with Stalinist or Satanist or other vibes can produce a misaligned model, then we either need to fine-tune on data with aligned vibes or just make sure that the bulk of pre-training data is “aligned”.