evhub comments on “Inner Alignment Failures” Which Are Actually Outer Alignment Failures

evhub 4 Nov 2020 21:26 UTC
LW: 2 AF: 2
0
AF
I think it’s possible to formulate definitions of inner alignment that don’t rely on mesa-optimizers but exclude capabilities (e.g. by using concepts like 2-D robustness), though I agree that it gets a lot trickier when you try to do that, which was a major motivator for why we just went with the very restrictive mechanistic definition that we ended up using in Risks.