Note that Paul Christiano is only talking about models trained on gradient descent, not more general models. Still, I doubt the (edit: Paul’s) claim; it seems to me that whatever model you have will be some kind of bundle of math that implicitly relies on various abstractions holding, and the abstractions might fail immediately as you leave the training set or hold out a bit longer, but there’s no guarantee that the abstractions that the alignment relies on will hold out as long as the abstractions important capabilities rely on.
Note that Paul Christiano is only talking about models trained on gradient descent, not more general models. Still, I doubt the (edit: Paul’s) claim; it seems to me that whatever model you have will be some kind of bundle of math that implicitly relies on various abstractions holding, and the abstractions might fail immediately as you leave the training set or hold out a bit longer, but there’s no guarantee that the abstractions that the alignment relies on will hold out as long as the abstractions important capabilities rely on.