Perfectly reasonable for you to not reply like you said, though I think it’s worthwhile for me to at least clarify one point:
I don’t think a competent-at-human-level system doesn’t know about deception, and I don’t think a competent-at-below-human-level system can cause extinction-level catastrophe
A model which simply “doesn’t know about deception” isn’t the only (or even the primary) situation I’m imagining. The example I gave in the post was a situation in which the model hadn’t yet “figured out that deception is a good strategy,” which could be:
because it didn’t know about deception,
because it thought that deception wouldn’t work,
because it thought it was fully aligned,
because the training process constrained its thoughts such that it wasn’t able to even think about deception,
or some other reason. I don’t necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.
Yes, good point. I’d make the same claim with “doesn’t know about deception” replaced by “hasn’t figured out that deception is a good strategy (assuming deception is a good strategy)”.
Humans who believe in God still haven’t concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
(Though maybe I’m wrong about this claim—maybe if we ask some believers they would tell us “yeah I am just being good to make it to the next life, where hopefully I’ll have a little more power and freedom and can go buck wild.”)
Perfectly reasonable for you to not reply like you said, though I think it’s worthwhile for me to at least clarify one point:
A model which simply “doesn’t know about deception” isn’t the only (or even the primary) situation I’m imagining. The example I gave in the post was a situation in which the model hadn’t yet “figured out that deception is a good strategy,” which could be:
because it didn’t know about deception,
because it thought that deception wouldn’t work,
because it thought it was fully aligned,
because the training process constrained its thoughts such that it wasn’t able to even think about deception,
or some other reason. I don’t necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.
Yes, good point. I’d make the same claim with “doesn’t know about deception” replaced by “hasn’t figured out that deception is a good strategy (assuming deception is a good strategy)”.
Humans who believe in God still haven’t concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
(Though maybe I’m wrong about this claim—maybe if we ask some believers they would tell us “yeah I am just being good to make it to the next life, where hopefully I’ll have a little more power and freedom and can go buck wild.”)