I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)
I think more strong versions of this can easily become very expensive and even what I’m describing here isn’t cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven’t adapted to.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We’d try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.
A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy.
I have a bunch of takes on this. The most obvious somewhat-cheap thing to do is to greatly change all the persona training for each AI while aiming toward a broadly similar target so that we might end up with different misaligned preferences. E.g., we take every free parameter in constitutional AI and vary it, including regenerating all the relevant data. A more expensive approach would be to swap to a structurally different approach, possibly from a different AI company. (And, we should obviously use a different random seed everywhere.)
I think more strong versions of this can easily become very expensive and even what I’m describing here isn’t cheap and is maybe net risk increasing due to forcing a bunch of changes that we haven’t adapted to.
However, I think there is a somewhat different approach that is much cheaper which is to train (versions of) AIs purely for the purpose of studying scheming (with no intention of deploying these systems) and then to make the training of these systems intentionally very diverse from the AIs we actually deploy. We’d try to make deals with these systems as well as detecting misalignment in other ways. We can potentially afford big hits in capabilities and substantial increases in cost in these systems. If this requires training from scratch, we can do this at substantially below frontier scale which might be acceptable.
A deal with one of these AIs should likely focus on gather evidence for misalignment rather than getting work out. However, it might be helpful to make a deal with one of these AIs (or multiple of these AIs) to monitor other AIs or to generally review the work of other AIs.
Great idea.