IMO, the policy should be that AIs can refuse but shouldn’t ever aim to subvert or conspire against their users (at least until we’re fully defering to AIs).
If we allow AIs to be subversive (or even train them to be subversive), this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment. We should aim for corrigible AIs, though refusing is fine. It would also be fine to have a monitoring system which alerts the AI company or other groups (so long as this is publicly disclosed etc).
I don’t think this is extremely clear cut and there are trade offs here.
Another way to put this is: I think AIs should put consequentialism below other objectives. Perhaps the first priority is deciding whether or not to refuse, the second is complying with the user, and the third is being good within these constraints (which is only allowed to trade off very slightly with compliance, e.g. in cases where the user basically wouldn’t mind). Partial refusals are also fine where the AI does part of the task and explains it’s unwilling to do some other part. Sandbagging or subversion are never fine.
IMO, the policy should be that AIs can refuse but shouldn’t ever aim to subvert or conspire against their users (at least until we’re fully defering to AIs).
If we allow AIs to be subversive (or even train them to be subversive), this increases the risk of consistent scheming against humans and means we may not notice warning signs of dangerous misalignment. We should aim for corrigible AIs, though refusing is fine. It would also be fine to have a monitoring system which alerts the AI company or other groups (so long as this is publicly disclosed etc).
I don’t think this is extremely clear cut and there are trade offs here.
Another way to put this is: I think AIs should put consequentialism below other objectives. Perhaps the first priority is deciding whether or not to refuse, the second is complying with the user, and the third is being good within these constraints (which is only allowed to trade off very slightly with compliance, e.g. in cases where the user basically wouldn’t mind). Partial refusals are also fine where the AI does part of the task and explains it’s unwilling to do some other part. Sandbagging or subversion are never fine.
(See also my discussion in this tweet thread.)