I’m not sure that I understand your definition of suboptimality deceptive alignment correctly. My current (probably wrong) interpretation of it is: “a model has a suboptimality deceptive alignment problem if it does not currently have a deceptive alignment problem but will plausibly have one in the future”.
This sounds to me like a wrong interpretation of this concept—perhaps you could point out how it differs from the correct interpretation?
If my interpretation is roughly correct, I suggest naming this concept in a way that would not imply that it is a special case of deceptive alignment. Maybe “prone to deceptive alignment”?
I think it’s more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn’t using deception yet. Which is a more specific version of the thing you said.
I’m not sure that I understand your definition of suboptimality deceptive alignment correctly. My current (probably wrong) interpretation of it is: “a model has a suboptimality deceptive alignment problem if it does not currently have a deceptive alignment problem but will plausibly have one in the future”. This sounds to me like a wrong interpretation of this concept—perhaps you could point out how it differs from the correct interpretation?
If my interpretation is roughly correct, I suggest naming this concept in a way that would not imply that it is a special case of deceptive alignment. Maybe “prone to deceptive alignment”?
I think it’s more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn’t using deception yet. Which is a more specific version of the thing you said.