maybe too late, but here are some thoughts (TL;DR out-of-distribution prompt-based stress tests, and maybe some fancy SDF stuff) https://www.lesswrong.com/posts/RQadLjnmBZtvg7p8W/on-meta-level-adversarial-evaluations-of-white-box-alignment
maybe too late, but here are some thoughts (TL;DR out-of-distribution prompt-based stress tests, and maybe some fancy SDF stuff) https://www.lesswrong.com/posts/RQadLjnmBZtvg7p8W/on-meta-level-adversarial-evaluations-of-white-box-alignment