I think what we want might not be well described by behavioral lock-in to make sure a propensity isn’t modified by further training (at least of the kind you’re describing). A weak model could appear to have good propensities because it either isn’t capable enough to think of strategies that we would consider undesirable but which are permissive under its propensities, or because it hasn’t encountered a situation where its propensities are strongly tested.
For example, I think Claude 3 Opus is probably the most aligned model ever made, but I would still be extremely wary of bootstrapping it to superintelligent levels without some process that robustly updates its representation of its own values at each step. A lower-level example is training an ELK head on a model: if you trained a truthful ELK head at an early point where that’s easier to learn and then froze it so that it never updates toward being a human simulator, at some point your model’s internal concepts would just become pretty uninterpretable to your ELK head[1].
There’s a version of this that looks like training in a desirable propensity, and then only updating it to conform to evolving capabilities but not change its “direction”. This looks like a version of the standard ontology identification problem, though.
Or put another way: training in a circuit for aligned behavior such that this circuit naturally preserves its alignment as the rest of the model’s ontology changes is probably even more powerful than a general solution to alignment, because it requires specifying in advance how this propensity should be represented across ontologies.
A weak model could appear to have good propensities because it either isn’t capable enough to think of strategies that we would consider undesirable but which are permissive under its propensities
Let me see if I understand, using scheming as an example. IIUC you’re saying something like this: maybe GPT-5 isn’t competent enough (yet) to scheme, thus alignment training which looks like “negatively reinforce bad things the model does” doesn’t end up fixing scheming.
I agree with this, but I claim the problem was starting from a bad initial state. I guess I mostly expect that (i) we’ll be reasonably vigilant for early signs of scheming, and (ii) our alignment techniques work in-distribution to prevent scheming, and (iii) we can make most deployment scenarios relatively in-distribution for the model by improving the post-training mix. IOW we can always do alignment training such that we start with models that are reasonably aligned in the settings we care about.
But we might do additional capabilities training after alignment training, as seems to be the case for RL’ing models. That motivates me to think about how to avoid ‘alignment drift’ or ‘persona drift’ during this subsequent training.
I think what we want might not be well described by behavioral lock-in to make sure a propensity isn’t modified by further training (at least of the kind you’re describing). A weak model could appear to have good propensities because it either isn’t capable enough to think of strategies that we would consider undesirable but which are permissive under its propensities, or because it hasn’t encountered a situation where its propensities are strongly tested.
For example, I think Claude 3 Opus is probably the most aligned model ever made, but I would still be extremely wary of bootstrapping it to superintelligent levels without some process that robustly updates its representation of its own values at each step. A lower-level example is training an ELK head on a model: if you trained a truthful ELK head at an early point where that’s easier to learn and then froze it so that it never updates toward being a human simulator, at some point your model’s internal concepts would just become pretty uninterpretable to your ELK head[1].
There’s a version of this that looks like training in a desirable propensity, and then only updating it to conform to evolving capabilities but not change its “direction”. This looks like a version of the standard ontology identification problem, though.
Or put another way: training in a circuit for aligned behavior such that this circuit naturally preserves its alignment as the rest of the model’s ontology changes is probably even more powerful than a general solution to alignment, because it requires specifying in advance how this propensity should be represented across ontologies.
Paul talks about training the ELK head alongside the model at every single step to prevent this exact failure mode from happening here.
Let me see if I understand, using scheming as an example. IIUC you’re saying something like this: maybe GPT-5 isn’t competent enough (yet) to scheme, thus alignment training which looks like “negatively reinforce bad things the model does” doesn’t end up fixing scheming.
I agree with this, but I claim the problem was starting from a bad initial state. I guess I mostly expect that (i) we’ll be reasonably vigilant for early signs of scheming, and (ii) our alignment techniques work in-distribution to prevent scheming, and (iii) we can make most deployment scenarios relatively in-distribution for the model by improving the post-training mix. IOW we can always do alignment training such that we start with models that are reasonably aligned in the settings we care about.
But we might do additional capabilities training after alignment training, as seems to be the case for RL’ing models. That motivates me to think about how to avoid ‘alignment drift’ or ‘persona drift’ during this subsequent training.