I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.
I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.
I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.
I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.