leogao comments on A positive case for how we might succeed at prosaic AI alignment

leogao 20 Nov 2021 19:01 UTC
LW: 17 AF: 12
0
AF
My attempt at a one sentence summary of the core intuition behind this proposal: if you can be sure your model isn’t optimizing for deceiving you, you can relatively easily tell if it’s trying to optimize for something you don’t want by just observing whether your model seems to be trying to do something obviously different from what you want during training, because it’s much harder to slip under the radar by getting really lucky than by intentionally trying to.