I think I’m quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it’s aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can’t really consider HCH aligned (and thus can’t consider imitative amplification outer aligned at optimum).
I think I’m quite happy even if the optimal model is just trying to do what we want. With imitative amplification, the true optimum—HCH—still has benign failures, but I nevertheless want to argue that it’s aligned. In fact, I think this post really only makes sense if you adopt a definition of alignment that excludes benign failures, since otherwise you can’t really consider HCH aligned (and thus can’t consider imitative amplification outer aligned at optimum).