Raemon comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Raemon 11 Mar 2025 3:31 UTC
11 points
2
Curated. This was one of the more interesting results from the alignment scene in awhile.
I did like Martin Randall’s comment distinguishing “alignment” from “harmless” in the Helpful/Harmless/Honest sense (i.e. the particular flavor of ‘harmlessness’ that got trained into the AI). I don’t know whether Martin’s particular articulation is correct for what’s going on here, but in general it seems important to track that just because we’ve identified some kind of vector, that doesn’t mean we necessarily understand what that vector means. (I also liked that Martin gave some concrete predictions implied by his model)