For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty.
But maybe you only want to “prove” inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.
For a provably aligned (or probably aligned) system you need a formal specification of alignment. Do you have something in mind for that? This could be a major difficulty. But maybe you only want to “prove” inner alignment and assume that you already have an outer-alignment-goal-function, in which case defining alignment is probably easier.
correct, i’m imagining these being solved separately