[Question] ELI5 Why isn’t alignment *easier* as models get stronger?

Empirical Status: straw-man, just wondering what the strongest counterarguments to this are.

It seems obvious to me that stronger models are easier to align. A simple proof

  1. It is always possible to get a weaker model out of a stronger model (for example, by corrupting n% of its inputs/​outputs)

  2. Therefore if it possible to align a weak model, it is at least as easy to align a strong model

  3. It is unlikely to be exactly as hard to align weak/​strong models.

  4. Therefore it is easier to align stronger models

(I have a list of counter-arguments written down, I am interested to see if anyone suggests a counterargument better than the ones on my list)