I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)
I think more strong versions of this can easily become very expensive and even what I’m describing here isn’t cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven’t adapted to.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We’d try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.
A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.
Let’s say future post-training involves training in 5 diff RL environments one after the other, but the order doesn’t matter for capabilities. It’s possible that AI picks up it’s goals from the first environment and then plays the training game from there.
Then intentionally mix up the order for subsequent generations.
You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it’s already happening.
(Tbc, I haven’t thought about whether this would work in practice, but inside view seems worth thinking more about it)
hmm, i think there could be other structural things.
Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.
How would we do this?
I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)
I think more strong versions of this can easily become very expensive and even what I’m describing here isn’t cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven’t adapted to.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We’d try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.
A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.
Great idea.
Let’s say future post-training involves training in 5 diff RL environments one after the other, but the order doesn’t matter for capabilities. It’s possible that AI picks up it’s goals from the first environment and then plays the training game from there.
Then intentionally mix up the order for subsequent generations.
You could do this today with pre-training (eg randomly shuffling the data order each time), and maybe it’s already happening.
(Tbc, I haven’t thought about whether this would work in practice, but inside view seems worth thinking more about it)
I think this is basically a special case of changing the random seed which already randomizes env order probably.
hmm, i think there could be other structural things.
Maybe you shift around different stages of post-training
Maybe you realise that randomising means you get the same values each time, so you purposefully over-sample certain types of environments or pre-training data earlier on. E.g. first pre-training heavily on math, then heavily on fiction; then reversing that for the next gen. Or first doing RL on lots of SWE envs, then doing general computer-use envs; then reversing the order for the next gen.
You could potentially do experiments to determine what kind of features of the data/envs influences the values of the AI and then make sure that those specific features differ in the early stages of fine-tuning for different training runs.