I think that suboptimality deceptive alignment complicates a lot of stories for how we can correct issues in our AIs as they appear.
I don’t think it would be useful to actually discuss this, since I expect the cruxes for our disagreement are elsewhere, but since it is a direct disagreement with my position, I’ll state (but not argue for) my position here:
I expect that something (people, other AIs) will continue to monitor AI systems after deployment, though probably not as much as during training.
I don’t think a competent-at-human-level system doesn’t know about deception, and I don’t think a competent-at-below-human-level system can cause extinction-level catastrophe (modulo something about “unintentionally causing us to have the wrong values”)
Even if the AI system “discovers” deception during deployment, I expect that it will be bad at deception, or will deceive us in small-stakes situations, and we’ll notice and fix the problem.
(There’s a decent chance that I don’t reply to replies to this comment.)
Perfectly reasonable for you to not reply like you said, though I think it’s worthwhile for me to at least clarify one point:
I don’t think a competent-at-human-level system doesn’t know about deception, and I don’t think a competent-at-below-human-level system can cause extinction-level catastrophe
A model which simply “doesn’t know about deception” isn’t the only (or even the primary) situation I’m imagining. The example I gave in the post was a situation in which the model hadn’t yet “figured out that deception is a good strategy,” which could be:
because it didn’t know about deception,
because it thought that deception wouldn’t work,
because it thought it was fully aligned,
because the training process constrained its thoughts such that it wasn’t able to even think about deception,
or some other reason. I don’t necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.
Yes, good point. I’d make the same claim with “doesn’t know about deception” replaced by “hasn’t figured out that deception is a good strategy (assuming deception is a good strategy)”.
Humans who believe in God still haven’t concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
(Though maybe I’m wrong about this claim—maybe if we ask some believers they would tell us “yeah I am just being good to make it to the next life, where hopefully I’ll have a little more power and freedom and can go buck wild.”)
I don’t think it would be useful to actually discuss this, since I expect the cruxes for our disagreement are elsewhere, but since it is a direct disagreement with my position, I’ll state (but not argue for) my position here:
I expect that something (people, other AIs) will continue to monitor AI systems after deployment, though probably not as much as during training.
I don’t think a competent-at-human-level system doesn’t know about deception, and I don’t think a competent-at-below-human-level system can cause extinction-level catastrophe (modulo something about “unintentionally causing us to have the wrong values”)
Even if the AI system “discovers” deception during deployment, I expect that it will be bad at deception, or will deceive us in small-stakes situations, and we’ll notice and fix the problem.
(There’s a decent chance that I don’t reply to replies to this comment.)
Perfectly reasonable for you to not reply like you said, though I think it’s worthwhile for me to at least clarify one point:
A model which simply “doesn’t know about deception” isn’t the only (or even the primary) situation I’m imagining. The example I gave in the post was a situation in which the model hadn’t yet “figured out that deception is a good strategy,” which could be:
because it didn’t know about deception,
because it thought that deception wouldn’t work,
because it thought it was fully aligned,
because the training process constrained its thoughts such that it wasn’t able to even think about deception,
or some other reason. I don’t necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.
Yes, good point. I’d make the same claim with “doesn’t know about deception” replaced by “hasn’t figured out that deception is a good strategy (assuming deception is a good strategy)”.
Humans who believe in God still haven’t concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
(Though maybe I’m wrong about this claim—maybe if we ask some believers they would tell us “yeah I am just being good to make it to the next life, where hopefully I’ll have a little more power and freedom and can go buck wild.”)