nostalgebraist’s critique of ‘Agentic Misalignment’ as too story-like and fake. Relatedly, nostalgebraist’s ‘The Void’, which considers the narrative properties of LLM ‘assistant’ characters.
‘Simulators’ was the first look at LLMs as fundamentally narrative, and other work from janus and colleagues (largely on twitter) has continued that direction.
I think the experimental approach you propose is worth refining in light of past work and then moving forward with.
I do think it makes sense, although the distinction between the explicit-instructions cases seems unclear to me.
My main suggestion is that there’s previous work worth being aware of if you’re not already:
‘Self-Fulfilling Misalignment Data Might Be Poisoning Our AI Models’
nostalgebraist’s critique of ‘Agentic Misalignment’ as too story-like and fake. Relatedly, nostalgebraist’s ‘The Void’, which considers the narrative properties of LLM ‘assistant’ characters.
‘Simulators’ was the first look at LLMs as fundamentally narrative, and other work from janus and colleagues (largely on twitter) has continued that direction.
I think the experimental approach you propose is worth refining in light of past work and then moving forward with.