Another work on fuzzing is Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by @joshc, @kh4dien, and Severin Field. Fuzzing was the most effective strategy they tested for telling which of a pair of LLMs has been finetuned to be misaligned (when you only have access to inputs where they produce the same outputs).
Another work on fuzzing is Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by @joshc, @kh4dien, and Severin Field. Fuzzing was the most effective strategy they tested for telling which of a pair of LLMs has been finetuned to be misaligned (when you only have access to inputs where they produce the same outputs).