I think I’d advocate for letting “inner alignment” failures refer to objective robustness failures broadly, talking about “mesa optimization failures” as such, and then leaving the question about whether there are problematic inner alignment failures that aren’t mesa optimization-related on the table.
Thank you so much for writing this! I’ve been confused about this terminology for a while and I really like your reframing.
An additional terminological point that I think it would be good to solidify is what people mean when they refer to “inner alignment” failures. As you alude to, my impression is that some people use it to refer to objective robustness failures, broadly, whereas others (e.g. Evan) use it to refer to failures that involve mesa optimization. There is then additional confusion around whether we should think “inner alignment” failures that don’t involve mesa optimization will be catastrophic and, relatedly, around whether humans count as mesa optimizers.
I think I’d advocate for letting “inner alignment” failures refer to objective robustness failures broadly, talking about “mesa optimization failures” as such, and then leaving the question about whether there are problematic inner alignment failures that aren’t mesa optimization-related on the table.