This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.
I’d strongly agree that separating out ‘deception’ per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.
I tend to use ‘deceptive alignment’ slightly more broadly—i.e. something could be deceptively aligned post-training, even if all updates after that point are ‘in context’ or whatever analogue is relevant at that time. Right? This would be more than ‘mere’ deception, if it’s deception of operators or other-nominally-in-charge-people regarding the intentions (goals, objectives, etc) of the system. Also doesn’t need to be ‘net internal’ or anything like that.
I think what you’re pointing at here by ‘deceptive alignment’ is what I’d call ‘training hacking’, which is more specific. In my terms, that’s deceptive alignment of a training/update/selection/gating/eval process (which can include humans or not), generally construed to be during some designated training phase, but could also be ongoing.
No claim here to have any authoritative ownership over those terms, but at least as a taxonomy, those things I’m pointing at are importantly distinct, and there are more than two of them! I think the terms I use are good.
Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we’ll share some content from it at some point. In the mean time, my take after that is roughly
deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others)
it’s still unclear how much deliberation about deception can/will happen ‘net-internally’ vs externalised
externalised deliberation about deceptive alignment is still deceptive alignment in my terms!
I keep notes in my diary about how I’m going to coordinate my coup
steganographic deliberation about deceptive alignment is scarier
my notes are encrypted
fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch?
like, it’s all in my brain
I think another thing people are often arguing about without making it clear is how ‘net internal’ the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it’s basically orthogonal to the discussion about deception and deceptive alignment.[1]
More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible—though we don’t have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there’s not-very-general pattern-matching in there which gives rise to that, or there’s some more general fragments of net-internal deliberation. Other AI systems very obviously have internal deliberation, but these might end up moot depending on what paths to AGI will/should be taken.
ETA I don’t mean to suggest net-internal vs externalised is independent from discussions about deceptive alignment. They move together, for sure, especially when discussing where to prioritise research. But they’re different factors.
I mean the deliberation happens in a neural network. Maybe you thought I meant ‘net’ as in ‘after taking into account all contributions’? I should say ‘NN-internal’ instead, probably.
I think the broader use is sensible—e.g. to include post-training.
However, I’m not sure how narrow you’d want [training hacking] to be. Do you want to call it training only if NN internals get updated by default? Or just that it’s training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a …selection… process that could be ongoing], seems to cover all deceptive alignment—potential deletion/adjustment being a selection process).
Fine if there’s no bright line—I’d just be curious to know your criteria.
I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.
I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.
This is great, and thanks for pointing at this confusion, and raising the hypothesis that it could be a confusion of language! I also have this sense.
I’d strongly agree that separating out ‘deception’ per se is importantly different from more specific phenomena. Deception is just, yes, obviously this can and does happen.
I tend to use ‘deceptive alignment’ slightly more broadly—i.e. something could be deceptively aligned post-training, even if all updates after that point are ‘in context’ or whatever analogue is relevant at that time. Right? This would be more than ‘mere’ deception, if it’s deception of operators or other-nominally-in-charge-people regarding the intentions (goals, objectives, etc) of the system. Also doesn’t need to be ‘net internal’ or anything like that.
I think what you’re pointing at here by ‘deceptive alignment’ is what I’d call ‘training hacking’, which is more specific. In my terms, that’s deceptive alignment of a training/update/selection/gating/eval process (which can include humans or not), generally construed to be during some designated training phase, but could also be ongoing.
No claim here to have any authoritative ownership over those terms, but at least as a taxonomy, those things I’m pointing at are importantly distinct, and there are more than two of them! I think the terms I use are good.
Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we’ll share some content from it at some point. In the mean time, my take after that is roughly
deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others)
it’s still unclear how much deliberation about deception can/will happen ‘net-internally’ vs externalised
externalised deliberation about deceptive alignment is still deceptive alignment in my terms!
I keep notes in my diary about how I’m going to coordinate my coup
steganographic deliberation about deceptive alignment is scarier
my notes are encrypted
fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch?
like, it’s all in my brain
I think another thing people are often arguing about without making it clear is how ‘net internal’ the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it’s basically orthogonal to the discussion about deception and deceptive alignment.[1]
More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible—though we don’t have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there’s not-very-general pattern-matching in there which gives rise to that, or there’s some more general fragments of net-internal deliberation. Other AI systems very obviously have internal deliberation, but these might end up moot depending on what paths to AGI will/should be taken.
ETA I don’t mean to suggest net-internal vs externalised is independent from discussions about deceptive alignment. They move together, for sure, especially when discussing where to prioritise research. But they’re different factors.
What does “net internal” mean?
I mean the deliberation happens in a neural network. Maybe you thought I meant ‘net’ as in ‘after taking into account all contributions’? I should say ‘NN-internal’ instead, probably.
I think the broader use is sensible—e.g. to include post-training.
However, I’m not sure how narrow you’d want [training hacking] to be.
Do you want to call it training only if NN internals get updated by default? Or just that it’s training hacking if it occurs during the period we consider training? (otherwise, [deceptive alignment of a …selection… process that could be ongoing], seems to cover all deceptive alignment—potential deletion/adjustment being a selection process).
Fine if there’s no bright line—I’d just be curious to know your criteria.
I’d probably be more specific and say ‘gradient hacking’ or ‘update hacking’ for deception of a training process which updates NN internals.
I see what you’re saying with a deployment scenario being often implicitly a selection scenario (should we run the thing more/less or turn it off?) in practice. So deceptive alignment at deploy-time could be a means of training (selection) hacking.
More centrally, ‘training hacking’ might refer to a situation with denser oversight and explicit updating/gating.
Deceptive alignment during this period is just one way of training hacking (could alternatively hack exploration, cyber crack and literally hack oversight/updating, …). I didn’t make that clear in my original comment and now I think there’s arguably a missing term for ‘deceptive alignment for training hacking’ but maybe that’s fine.