DavidW comments on Does SGD Produce Deceptive Alignment?

DavidW 27 Feb 2023 15:31 UTC
4 points
0
Thanks for writing this up clearly! I don’t agree that gradient descent favors deception. In fact, I’ve made detailed, object-level arguments for the opposite. To become aligned, the model needs to understand the base goal and point at it. To become deceptively aligned, the model needs to have long-run goal and situational awareness before or around the same time as it understands the base goal. I argue that this makes deceptive alignment much harder to achieve and much less likely to come from gradient descent. I’d love to hear what you think of my arguments!