Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what’s actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it’s “inner misalignment” or “goal misgeneralization” or “deceptive alignment” or “demons in Solmonoff induction” or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.
Inner misalignment is a story for why one might expect capable but misaligned out-of-distribution behaviour, which is what’s actually bad. Model-checking could rule that out entirely (relative to the formal specification)— whether it’s “inner misalignment” or “goal misgeneralization” or “deceptive alignment” or “demons in Solmonoff induction” or whatever kind of story might explain such output. Formal verification is qualitatively different from the usual game of debugging whack-a-mole that software engineers play to get software to behave acceptably.