A nitpick: Maybe tweak the introduction a bit to make clear that it’s not a definition of “A deceptively aligned model”. (as a definition, what you’ve written describes any form of proxy alignment; deceptive alignment is more specific: there’s no ‘Perhaps’)
A nitpick:
Maybe tweak the introduction a bit to make clear that it’s not a definition of “A deceptively aligned model”. (as a definition, what you’ve written describes any form of proxy alignment; deceptive alignment is more specific: there’s no ‘Perhaps’)