Adding this comment based on a discussion on this post outside of LessWrong.
The misalignment vs. failure to align distinction may not be the most useful framing for all readers. A more direct question to consider is:
“Can the misaligned model stay deployed and acquire power?”
This framing clarifies the core argument: many x-risk evaluation papers do not adequately address this question. In fact often capability evaluations can be more informative for this purpose. In contrast, repetitive papers stress-testing alignment tend to produce insights equivalent to discovering new jailbreaks, which offers limited value when current model’s alignment is already known to be brittle.
Adding this comment based on a discussion on this post outside of LessWrong.
The misalignment vs. failure to align distinction may not be the most useful framing for all readers. A more direct question to consider is:
“Can the misaligned model stay deployed and acquire power?”
This framing clarifies the core argument: many x-risk evaluation papers do not adequately address this question. In fact often capability evaluations can be more informative for this purpose. In contrast, repetitive papers stress-testing alignment tend to produce insights equivalent to discovering new jailbreaks, which offers limited value when current model’s alignment is already known to be brittle.