The researchers definitely did good work, and for me, this is both bad and surprising news. The misses (e.g., targeting Stalin but getting Lenin, or Catholicism yielding Eastern Orthodoxy) have a clear explanation in that the confused concepts seem close conceptually and thus in latent space. This might give us room for optimism. If fine-tuning on data with Stalinist or Satanist or other vibes can produce a misaligned model, then we either need to fine-tune on data with aligned vibes or just make sure that the bulk of pre-training data is “aligned”.
The researchers definitely did good work, and for me, this is both bad and surprising news. The misses (e.g., targeting Stalin but getting Lenin, or Catholicism yielding Eastern Orthodoxy) have a clear explanation in that the confused concepts seem close conceptually and thus in latent space. This might give us room for optimism. If fine-tuning on data with Stalinist or Satanist or other vibes can produce a misaligned model, then we either need to fine-tune on data with aligned vibes or just make sure that the bulk of pre-training data is “aligned”.