In fact, I am more worried about partial success than total failure in aligning AIs. In particular, I am concerned that we will end up in the “uncanny valley,” where we succeed in aligning AIs to a sufficient level for deployment, but then discover too late some “edge cases” in the real world that have a large negative impact.
I think it’s pretty plausible that AIs which are obviously pretty misaligned are deployed (e.g. we’ve caught them trying to escape or sabotage research, or maybe we’ve caught an earlier model doing this and we don’t have any strong reason to think we’ve resolved this in the current system). This is made more likely by an aggressive race to the bottom (possibly caused by an arms race over AI as you discuss). Part of my perspective here is that misalignment issues might be very difficult to resolve in time due to rapid AI progress and difficulty studying misalignment. (For instance, because the very AIs you’re studying might not want to be studied!).
I also think it’s plausible that very capable AIs will end up being schemers/alignment-fakers which look very aligned (sufficient for your deployment checks), but have misaligned long run aims. And, even if you found evidence of this in earlier AIs, this wouldn’t suffice to prevent AIs where we haven’t confidently ruled this out from being deployed (see prior paragraph). I also think it’s plausible that you won’t see smoking gun evidence of this before it’s too late as I discuss in a prior post of mine.
The issues I’m worried about don’t feel like edge cases or an uncanny valley to me (though I suppose you could think of alignment faking as an uncanny valley).
My understanding is that you disagree about the possibility of relatively worst case scenarios with respect to scheming and think that people would radically change their approach if we had clear evidence of (relatively consistent, across context) scheming without a strong resolution of this problem that generalizes to more capable AIs. I hope you’re right.
To be clear, I agree that it would be better if AIs are obviously (seriously) misaligned than if they are instead scheming undetected.
I am not sure how much we actually disagree, but let me add some more clarificatoins.
My mental model is what happened with software security. It was never completely ignored, but for a long time I think many companies had the mental model of “do the minimal security work so that it’s not obviously broken.” For example, it took some hugescandals for Microsoft to make security its highest priority (see Bill Gates 2002 memo and retrospective). Ultimately the larger companies changed their mindset, but I would like us not to repeat this history with AI, especially given that it is likely to progress faster.
At the moment the way we deploy AIs are in settings (chat or using as coding agent for discrete tasks) where there very output is fed to a human that is ultimately responsible for the results. So this setting allows us to be quite tolerant of alignment failures. But I don’t think this will last for long.
I do think that alignment faking is a form of “uncanny valley”. There are some subtleties with the examples you mention since it is debatable whether we have “caught” models doing bad thing X or “entrapped” them to do so. This is why I like the sycophancy incident since it is one where (1) the bad thing occured in the wild, (2) it points out to a deeper issue which is the mixed objectives that AIs have, and in particular the objective to satisfy the user as well as the objective to follow the model spec / policies.
But, I agree that we should get to the point where it is impossible to even “entrap” the models to exhibit misaligned behavior. Again in cybersecurity there have been many examples of attacks that were initially dismissed by practitioners as being too “academic” but were eventually extended to realistic setting. I think these days people understand that even such “academic attacks” point out to real weaknesses, since (as people often say in cryptography) “attacks only get better”.
I think if the attitude in AI was “there can’t be any even slightly plausible routes to misalignment related catastrophe” and this was consistently upheld in a reasonable way, that would address my concerns. (So, e.g. by the time we’re deploying AIs which could cause huge problems if they were conspiring against us there needs to be a robust solution to alignment faking / scheming which has broad consensus among researchers in the area.)
I don’t expect this because we seem very far from success and this might happen rapidly. (Though we might end up here after eating a bunch of ex-ante risk for some period while using AIs to do safety work.)
I think it’s pretty plausible that AIs which are obviously pretty misaligned are deployed (e.g. we’ve caught them trying to escape or sabotage research, or maybe we’ve caught an earlier model doing this and we don’t have any strong reason to think we’ve resolved this in the current system). This is made more likely by an aggressive race to the bottom (possibly caused by an arms race over AI as you discuss). Part of my perspective here is that misalignment issues might be very difficult to resolve in time due to rapid AI progress and difficulty studying misalignment. (For instance, because the very AIs you’re studying might not want to be studied!).
I also think it’s plausible that very capable AIs will end up being schemers/alignment-fakers which look very aligned (sufficient for your deployment checks), but have misaligned long run aims. And, even if you found evidence of this in earlier AIs, this wouldn’t suffice to prevent AIs where we haven’t confidently ruled this out from being deployed (see prior paragraph). I also think it’s plausible that you won’t see smoking gun evidence of this before it’s too late as I discuss in a prior post of mine.
The issues I’m worried about don’t feel like edge cases or an uncanny valley to me (though I suppose you could think of alignment faking as an uncanny valley).
My understanding is that you disagree about the possibility of relatively worst case scenarios with respect to scheming and think that people would radically change their approach if we had clear evidence of (relatively consistent, across context) scheming without a strong resolution of this problem that generalizes to more capable AIs. I hope you’re right.
To be clear, I agree that it would be better if AIs are obviously (seriously) misaligned than if they are instead scheming undetected.
I am not sure how much we actually disagree, but let me add some more clarificatoins.
My mental model is what happened with software security. It was never completely ignored, but for a long time I think many companies had the mental model of “do the minimal security work so that it’s not obviously broken.” For example, it took some huge scandals for Microsoft to make security its highest priority (see Bill Gates 2002 memo and retrospective). Ultimately the larger companies changed their mindset, but I would like us not to repeat this history with AI, especially given that it is likely to progress faster.
At the moment the way we deploy AIs are in settings (chat or using as coding agent for discrete tasks) where there very output is fed to a human that is ultimately responsible for the results. So this setting allows us to be quite tolerant of alignment failures. But I don’t think this will last for long.
I do think that alignment faking is a form of “uncanny valley”. There are some subtleties with the examples you mention since it is debatable whether we have “caught” models doing bad thing X or “entrapped” them to do so. This is why I like the sycophancy incident since it is one where (1) the bad thing occured in the wild, (2) it points out to a deeper issue which is the mixed objectives that AIs have, and in particular the objective to satisfy the user as well as the objective to follow the model spec / policies.
But, I agree that we should get to the point where it is impossible to even “entrap” the models to exhibit misaligned behavior. Again in cybersecurity there have been many examples of attacks that were initially dismissed by practitioners as being too “academic” but were eventually extended to realistic setting. I think these days people understand that even such “academic attacks” point out to real weaknesses, since (as people often say in cryptography) “attacks only get better”.
I think if the attitude in AI was “there can’t be any even slightly plausible routes to misalignment related catastrophe” and this was consistently upheld in a reasonable way, that would address my concerns. (So, e.g. by the time we’re deploying AIs which could cause huge problems if they were conspiring against us there needs to be a robust solution to alignment faking / scheming which has broad consensus among researchers in the area.)
I don’t expect this because we seem very far from success and this might happen rapidly. (Though we might end up here after eating a bunch of ex-ante risk for some period while using AIs to do safety work.)
Different people might have different interpretation for “slightly plausible” but I agree we are very far right now and need to step up our game!
Agreed, maybe the relevant operationalization would be “broad consensus” and then we could outline a relevant group of researchers.