I am not sure how much we actually disagree, but let me add some more clarificatoins.
My mental model is what happened with software security. It was never completely ignored, but for a long time I think many companies had the mental model of “do the minimal security work so that it’s not obviously broken.” For example, it took some hugescandals for Microsoft to make security its highest priority (see Bill Gates 2002 memo and retrospective). Ultimately the larger companies changed their mindset, but I would like us not to repeat this history with AI, especially given that it is likely to progress faster.
At the moment the way we deploy AIs are in settings (chat or using as coding agent for discrete tasks) where there very output is fed to a human that is ultimately responsible for the results. So this setting allows us to be quite tolerant of alignment failures. But I don’t think this will last for long.
I do think that alignment faking is a form of “uncanny valley”. There are some subtleties with the examples you mention since it is debatable whether we have “caught” models doing bad thing X or “entrapped” them to do so. This is why I like the sycophancy incident since it is one where (1) the bad thing occured in the wild, (2) it points out to a deeper issue which is the mixed objectives that AIs have, and in particular the objective to satisfy the user as well as the objective to follow the model spec / policies.
But, I agree that we should get to the point where it is impossible to even “entrap” the models to exhibit misaligned behavior. Again in cybersecurity there have been many examples of attacks that were initially dismissed by practitioners as being too “academic” but were eventually extended to realistic setting. I think these days people understand that even such “academic attacks” point out to real weaknesses, since (as people often say in cryptography) “attacks only get better”.
I think if the attitude in AI was “there can’t be any even slightly plausible routes to misalignment related catastrophe” and this was consistently upheld in a reasonable way, that would address my concerns. (So, e.g. by the time we’re deploying AIs which could cause huge problems if they were conspiring against us there needs to be a robust solution to alignment faking / scheming which has broad consensus among researchers in the area.)
I don’t expect this because we seem very far from success and this might happen rapidly. (Though we might end up here after eating a bunch of ex-ante risk for some period while using AIs to do safety work.)
I am not sure how much we actually disagree, but let me add some more clarificatoins.
My mental model is what happened with software security. It was never completely ignored, but for a long time I think many companies had the mental model of “do the minimal security work so that it’s not obviously broken.” For example, it took some huge scandals for Microsoft to make security its highest priority (see Bill Gates 2002 memo and retrospective). Ultimately the larger companies changed their mindset, but I would like us not to repeat this history with AI, especially given that it is likely to progress faster.
At the moment the way we deploy AIs are in settings (chat or using as coding agent for discrete tasks) where there very output is fed to a human that is ultimately responsible for the results. So this setting allows us to be quite tolerant of alignment failures. But I don’t think this will last for long.
I do think that alignment faking is a form of “uncanny valley”. There are some subtleties with the examples you mention since it is debatable whether we have “caught” models doing bad thing X or “entrapped” them to do so. This is why I like the sycophancy incident since it is one where (1) the bad thing occured in the wild, (2) it points out to a deeper issue which is the mixed objectives that AIs have, and in particular the objective to satisfy the user as well as the objective to follow the model spec / policies.
But, I agree that we should get to the point where it is impossible to even “entrap” the models to exhibit misaligned behavior. Again in cybersecurity there have been many examples of attacks that were initially dismissed by practitioners as being too “academic” but were eventually extended to realistic setting. I think these days people understand that even such “academic attacks” point out to real weaknesses, since (as people often say in cryptography) “attacks only get better”.
I think if the attitude in AI was “there can’t be any even slightly plausible routes to misalignment related catastrophe” and this was consistently upheld in a reasonable way, that would address my concerns. (So, e.g. by the time we’re deploying AIs which could cause huge problems if they were conspiring against us there needs to be a robust solution to alignment faking / scheming which has broad consensus among researchers in the area.)
I don’t expect this because we seem very far from success and this might happen rapidly. (Though we might end up here after eating a bunch of ex-ante risk for some period while using AIs to do safety work.)
Different people might have different interpretation for “slightly plausible” but I agree we are very far right now and need to step up our game!
Agreed, maybe the relevant operationalization would be “broad consensus” and then we could outline a relevant group of researchers.