James Diacoumis comments on Will Any Crap Cause Emergent Misalignment?

James Diacoumis 28 Aug 2025 11:34 UTC
4 points
3
Perhaps this is technically tapping into human norms like “don’t randomly bring up poo in conversation” but if so, that’s unbelievably vague.
I think this explanation is likely correct on some level.
I made a post here which goes into more detail but the core idea is that there’s no “clean” separation between normative domains like aesthetic, moral and social etc… and the model needs to learn about all of them through a single loss function so everything gets tangled up.