A weak model could appear to have good propensities because it either isn’t capable enough to think of strategies that we would consider undesirable but which are permissive under its propensities
Let me see if I understand, using scheming as an example. IIUC you’re saying something like this: maybe GPT-5 isn’t competent enough (yet) to scheme, thus alignment training which looks like “negatively reinforce bad things the model does” doesn’t end up fixing scheming.
I agree with this, but I claim the problem was starting from a bad initial state. I guess I mostly expect that (i) we’ll be reasonably vigilant for early signs of scheming, and (ii) our alignment techniques work in-distribution to prevent scheming, and (iii) we can make most deployment scenarios relatively in-distribution for the model by improving the post-training mix. IOW we can always do alignment training such that we start with models that are reasonably aligned in the settings we care about.
But we might do additional capabilities training after alignment training, as seems to be the case for RL’ing models. That motivates me to think about how to avoid ‘alignment drift’ or ‘persona drift’ during this subsequent training.
Let me see if I understand, using scheming as an example. IIUC you’re saying something like this: maybe GPT-5 isn’t competent enough (yet) to scheme, thus alignment training which looks like “negatively reinforce bad things the model does” doesn’t end up fixing scheming.
I agree with this, but I claim the problem was starting from a bad initial state. I guess I mostly expect that (i) we’ll be reasonably vigilant for early signs of scheming, and (ii) our alignment techniques work in-distribution to prevent scheming, and (iii) we can make most deployment scenarios relatively in-distribution for the model by improving the post-training mix. IOW we can always do alignment training such that we start with models that are reasonably aligned in the settings we care about.
But we might do additional capabilities training after alignment training, as seems to be the case for RL’ing models. That motivates me to think about how to avoid ‘alignment drift’ or ‘persona drift’ during this subsequent training.