Ofer comments on More variations on pseudo-alignment

Ofer 5 Nov 2019 7:29 UTC
LW: 3 AF: 2
0
AF
I’m not sure that I understand your definition of suboptimality deceptive alignment correctly. My current (probably wrong) interpretation of it is: “a model has a suboptimality deceptive alignment problem if it does not currently have a deceptive alignment problem but will plausibly have one in the future”. This sounds to me like a wrong interpretation of this concept—perhaps you could point out how it differs from the correct interpretation?

If my interpretation is roughly correct, I suggest naming this concept in a way that would not imply that it is a special case of deceptive alignment. Maybe “prone to deceptive alignment”?
- Rohin Shah 5 Nov 2019 15:44 UTC
  LW: 5 AF: 4
  0
  AF Parent
  I think it’s more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn’t using deception yet. Which is a more specific version of the thing you said.