‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap.
And then I suppose my broader point would be that we should present:
Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment
‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap. And then I suppose my broader point would be that we should present:
Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment