Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point.
More generally, it’s uncertain what the impact is of excluding a certain topic from pretraining. In practice, you’ll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you’d remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.