Rohin Shah comments on More variations on pseudo-alignment

Rohin Shah 5 Nov 2019 15:44 UTC
LW: 5 AF: 4
0
AF
I think it’s more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn’t using deception yet. Which is a more specific version of the thing you said.