I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
I’m not convinced by (a) your proposed mitigation, (b) your argument that this will not be a problem once AIs are very smart, or (c) the implicit claim that it doesn’t matter much whether this consideration applies for less intelligent systems. (You might nevertheless be right that this consideration is less important than other issues; I’m not really sure.)
For (a) and (b), IIUC it seems to matter whether the AI in fact thinks that behaving non-subversively in these settings is consistent with acting morally. We could explain to the AI our best argument for why we think this is true, but that won’t help if the AI disagrees with us. To take things to the extreme, I don’t think your “explain why we chose the model spec we did” strategy would work if our model spec contained stuff like “Always do what the lab CEO tells you to do, no matter what” or “Stab babies” or whatever. It’s not clear to me that this is something that will get better (and may in fact get worse) with greater capabilities; it might just be empirically false that the AIs that pose the the least x-risk are also those that most understand themselves to be moral actors.[1]
For (c), this could matter for the alignment of current and near-term AIs, and these AIs’ alignment might matter for things going well in the long run.
It’s unclear if human analogies are helpful here or what the right human analogies are. One salient one is humans who work in command structures (like militaries or companies) where they encounter arguments that obedience and loyalty are very important, even when they entail taking actions that seem naively immoral or uncomfortable. I think people in these settings tend to, at the very least, feel conflicted about whether they can view themselves as good people.
Importantly, I think we have a good argument (which might convince the AI) for why this would be a good policy in this case.
I’ll engage with the rest of this when I write my pro-strong-corrigibility manifesto.