The structure you describe seems like it could work. It also seems like that’s the alignment target now and may be as we near AGI.
As you note, there’s a conflict between whatever long-term goals the AI has, and the deontological principles it’s following. We’d need to make very sure that conflict reliably goes in favor of the deontological rules like “follow instructions from authorized humans” even where those conflict with any or all of their other goals and values.
It seems simpler and safer to make that deontological principle the only one, or to have only weak and vague values/goals outside of that.
Training at cross-purposes seems like the major source of notable misalignments in current models. We could just not do it for models approaching AGI.
To Jeremy’s point in the other comment, the single target is also probably more reflectively stable.
There’s still plenty to go wrong, but this does seem to reduce the difficulty you note in having conflicting goals/principles of different priority that we’re trying to specify by training.
The structure you describe seems like it could work. It also seems like that’s the alignment target now and may be as we near AGI.
As you note, there’s a conflict between whatever long-term goals the AI has, and the deontological principles it’s following. We’d need to make very sure that conflict reliably goes in favor of the deontological rules like “follow instructions from authorized humans” even where those conflict with any or all of their other goals and values.
It seems simpler and safer to make that deontological principle the only one, or to have only weak and vague values/goals outside of that.
So it seems easier to make instruction-following the only training target, or similarly, Corrigibility as Singular Target. You’d then issue instructions for all of the other goals or behaviors you want. This puts more work on the operator, but you can instruct it to help you with that work, and it keeps prioritization in logic instead of training.
There are still Problems with instruction-following as an alignment target and similarly Serious Flaws in CAST but those problems have to be faced there anyway if corrigibility/IF are mixed in with a bunch of other alignment targets.
Training at cross-purposes seems like the major source of notable misalignments in current models. We could just not do it for models approaching AGI.
To Jeremy’s point in the other comment, the single target is also probably more reflectively stable.
There’s still plenty to go wrong, but this does seem to reduce the difficulty you note in having conflicting goals/principles of different priority that we’re trying to specify by training.