I’m having trouble coming up with a concrete scenario that this defends against. The model needs to be misaligned enough that we can’t trust it with alignment, smart enough to develop fake alignment tools which fool human overseers but actually reliably align the resulting model to its goals rather than ours, but not smart enough to realize when a problem its solving is a crucial step in the true alignment project? I expect alignment to have several irreducibly-hard parts and I expect those parts to be characteristic of alignment in particular, so this doesn’t seem like a coherent combination of capabilities to me.
I’m having trouble coming up with a concrete scenario that this defends against. The model needs to be misaligned enough that we can’t trust it with alignment, smart enough to develop fake alignment tools which fool human overseers but actually reliably align the resulting model to its goals rather than ours, but not smart enough to realize when a problem its solving is a crucial step in the true alignment project? I expect alignment to have several irreducibly-hard parts and I expect those parts to be characteristic of alignment in particular, so this doesn’t seem like a coherent combination of capabilities to me.