No offense. But imagine Nazi Germany wasn’t defeated, conquered most of Europe (taking bordering nations one at a time), then stopped its conquest (but continued its operation of exterminating Gays, Jews, Schizophrenics and so on).
Then you could write this same post and call it “Nazi Derangement Syndrome”
I agree that makes some sense.
Wrt end of training vs all of training, I don’t think I made an argument for that here.
My opinion is that we should probably do it nearer to the beginning of training, because alignment that creates robust internal desires can be sticky, maybe we will not need to do it so much more throughout training, where it would be harder as the model is smarter.