I think it’s more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn’t using deception yet. Which is a more specific version of the thing you said.
I think it’s more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn’t using deception yet. Which is a more specific version of the thing you said.