Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we’ll share some content from it at some point. In the mean time, my take after that is roughly
deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others)
it’s still unclear how much deliberation about deception can/will happen ‘net-internally’ vs externalised
externalised deliberation about deceptive alignment is still deceptive alignment in my terms!
I keep notes in my diary about how I’m going to coordinate my coup
steganographic deliberation about deceptive alignment is scarier
my notes are encrypted
fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch?
like, it’s all in my brain
I think another thing people are often arguing about without making it clear is how ‘net internal’ the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it’s basically orthogonal to the discussion about deception and deceptive alignment.[1]
More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible—though we don’t have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there’s not-very-general pattern-matching in there which gives rise to that, or there’s some more general fragments of net-internal deliberation. Other AI systems very obviously have internal deliberation, but these might end up moot depending on what paths to AGI will/should be taken.
ETA I don’t mean to suggest net-internal vs externalised is independent from discussions about deceptive alignment. They move together, for sure, especially when discussing where to prioritise research. But they’re different factors.
I mean the deliberation happens in a neural network. Maybe you thought I meant ‘net’ as in ‘after taking into account all contributions’? I should say ‘NN-internal’ instead, probably.
Some people seem to argue that concrete evidence of deception is no evidence for deceptive alignment. I had a great discussion with @TurnTrout a few weeks ago about this, where we honed in on our agreement and disagreement here. Maybe we’ll share some content from it at some point. In the mean time, my take after that is roughly
deception was obviously a priori going to be gettable, and now we have concrete evidence it occurs (approx 0 update for me, but >0 update for some)
this does support an expectation of deceptive alignment in my terms, because deception about intentions is pretty central deception, and with misaligned intentions, deception is broadly instrumental (again not much update for me, but >0 update for others)
it’s still unclear how much deliberation about deception can/will happen ‘net-internally’ vs externalised
externalised deliberation about deceptive alignment is still deceptive alignment in my terms!
I keep notes in my diary about how I’m going to coordinate my coup
steganographic deliberation about deceptive alignment is scarier
my notes are encrypted
fully-internal deliberation about deceptive alignment is probably scarier still, because probably harder to catch?
like, it’s all in my brain
I think another thing people are often arguing about without making it clear is how ‘net internal’ the relevant deliberation/situational-awareness can/will be (and in what ways they might be externalised)! For me, this is a really important factor (because it affects how and how easily we can detect such things), but it’s basically orthogonal to the discussion about deception and deceptive alignment.[1]
More tentatively, I think net-internal deliberation in LLM-like architectures is somewhat plausible—though we don’t have mechanistic understanding, we have evidence of outputs of sims/characters producing deliberation-like outputs without (much or any) intermediate chains of thought. So either there’s not-very-general pattern-matching in there which gives rise to that, or there’s some more general fragments of net-internal deliberation. Other AI systems very obviously have internal deliberation, but these might end up moot depending on what paths to AGI will/should be taken.
ETA I don’t mean to suggest net-internal vs externalised is independent from discussions about deceptive alignment. They move together, for sure, especially when discussing where to prioritise research. But they’re different factors.
What does “net internal” mean?
I mean the deliberation happens in a neural network. Maybe you thought I meant ‘net’ as in ‘after taking into account all contributions’? I should say ‘NN-internal’ instead, probably.