eggsyntax comments on Misgeneralization of Fictional Training Data as a Contributor to Misalignment

eggsyntax 30 Aug 2025 0:52 UTC
5 points
1
I do think it makes sense, although the distinction between the explicit-instructions cases seems unclear to me.
My main suggestion is that there’s previous work worth being aware of if you’re not already:
- ‘Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models’
- nostalgebraist’s critique of ‘Agentic Misalignment’ as too story-like and fake. Relatedly, nostalgebraist’s ‘The Void’, which considers the narrative properties of LLM ‘assistant’ characters.
- ‘Simulators’ was the first look at LLMs as fundamentally narrative, and other work from janus and colleagues (largely on twitter) has continued that direction.
I think the experimental approach you propose is worth refining in light of past work and then moving forward with.