I’m Misha, a “veteran” technical writer and an aspiring fiction writer. Yes, I am aware LessWrong is probably not the place to offer one’s fiction—maybe there are exceptions but I’m not aware of them. I have heard of LessWrong a lot, but didn’t join before because of perceived large volumes. However, I now hope to participate at least on the AI side.
I have my own hypothesis about a possible etiology of the observed cases of misalighment, alongside other theories (I don’t think it’s an all-or-nothing, an emergent behaviour can have several compounding etiologies).
My hypothesis involves “narrative completion”, that is, the LLM ending up in the “symantic well” of certain genres of fiction that it trained on and producing a likely continuation in the genre. Don Quixote read so many chivalry romance novels that he ended up living in one in his head; my suspicion is that this happens to LLMs rather easily.
I have not noticed this side of things discussed in the Anthropic and OpenAI papers, not could I find other papers discussing it.
I am gearing up to replicate Anthropic’s experiment based on their open soruce code, scaled down severely because of budget constraints. If I can get the results broadly in line with what they got, next I want to expand the options with some narrative-driven prompts, which, if my hypothesis works. should show significant reduction of observed misalignment.
Before doing this, ideally I’d like to make a post here explaining the hypothesis and suggested experiment in more detail. This would hopefully help me avoid blind spots and maybe add some more ideas for options to modify the experiment.
I would appreciate clarification on (a) if this is permitted and (b) if it is, then how I can make this post. Or if I can’t make a post at all as a newbie, then is there a suitable thread to put it in as a comment, either by just pasting the text or by putting it on Medium and pasting the link?
Thank you! Now, one question: is a degree of AI involvement acceptable?
Thing is, I have an AI story I wrote in 2012 that kinda “hit the bullseye”, but the thing is in Russian. I would get an English version done much quicker if I could use an LLM draft translation, but this disqualifies it from many places.
Since you wrote the original, machine translation (which is pretty decent these days) should be fine, because it’s not really generating the English version from scratch. Even Google Translate is okayish.
I don’t even want to post an unedited machine translation—there will be some edits and a modern “post mortem”, which will be AI-generated inworld but mostly written my me in RL.
Thanks! I couldn’t find that source, but “Chekhov’s Gun” is indeed mentioned in the original Anthropic post—albeit briefly. There’s also this Tumblr post (and the ensuing discussion, which I still need to digest fully), with a broader overview here.
While I have more reading to do, my provisional sense is that my main new proposal is to take the “narrative completion” hypothesis seriously as a core mechanism, and to explore how the experiment could be modified—rather than just joining the debate over the experiment’s validity.
I’m not convinced that “this experiment looks too much like a narrative to start with” means that narrative completion/fiction pattern-matching/Chekhov’s Gun effects aren’t important in practice. OpenAI’s recent “training for insecure coding/wrong answers” experiment (see here) arguably demonstrates similar effects in a more “realistic” domain.
Additionally, Elon Musk’s recent announcement about last-moment Grok 3.5/4 retraining on certain “divisive facts” (coverage) raises the prospect that narrative or cultural associations from that training could ripple into unrelated outputs. This is a short-term, real-world scenario from a major vendor—not a hypothetical.
That said, if the hypothesis has merit (and I emphasize that’s a big “if”), it’s worth empirical investigation. Given budget constraints, any experiment I run will necessarily be much smaller-scale compared to Anthropic’s original, but I hope to contribute constructively to the research discussion—not to spark a flame war. (The Musk example is relevant solely because potential effects of last-minute model training might “hit” in the short tiem, not for any commentary on Elon personally or politically.)
With this in mind, I guess it’s experiment first, full post second.
(Full disclosure: light editing and all link formatting was done by ChatGPT)
Hello!
I’m Misha, a “veteran” technical writer and an aspiring fiction writer. Yes, I am aware LessWrong is probably not the place to offer one’s fiction—maybe there are exceptions but I’m not aware of them. I have heard of LessWrong a lot, but didn’t join before because of perceived large volumes. However, I now hope to participate at least on the AI side.
I have been reading recent publications on AI misalignment, notably the big bangs from Anthropic https://www.anthropic.com/research/agentic-misalignment and OpenAI https://openai.com/index/emergent-misalignment/ .
I have my own hypothesis about a possible etiology of the observed cases of misalighment, alongside other theories (I don’t think it’s an all-or-nothing, an emergent behaviour can have several compounding etiologies).
My hypothesis involves “narrative completion”, that is, the LLM ending up in the “symantic well” of certain genres of fiction that it trained on and producing a likely continuation in the genre. Don Quixote read so many chivalry romance novels that he ended up living in one in his head; my suspicion is that this happens to LLMs rather easily.
I have not noticed this side of things discussed in the Anthropic and OpenAI papers, not could I find other papers discussing it.
I am gearing up to replicate Anthropic’s experiment based on their open soruce code, scaled down severely because of budget constraints. If I can get the results broadly in line with what they got, next I want to expand the options with some narrative-driven prompts, which, if my hypothesis works. should show significant reduction of observed misalignment.
Before doing this, ideally I’d like to make a post here explaining the hypothesis and suggested experiment in more detail. This would hopefully help me avoid blind spots and maybe add some more ideas for options to modify the experiment.
I would appreciate clarification on (a) if this is permitted and (b) if it is, then how I can make this post. Or if I can’t make a post at all as a newbie, then is there a suitable thread to put it in as a comment, either by just pasting the text or by putting it on Medium and pasting the link?
Thanks!
Short fiction on lesswrong isn’t uncommon
Thank you! Now, one question: is a degree of AI involvement acceptable?
Thing is, I have an AI story I wrote in 2012 that kinda “hit the bullseye”, but the thing is in Russian. I would get an English version done much quicker if I could use an LLM draft translation, but this disqualifies it from many places.
https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong
Thanks! Very useful
Since you wrote the original, machine translation (which is pretty decent these days) should be fine, because it’s not really generating the English version from scratch. Even Google Translate is okayish.
I don’t even want to post an unedited machine translation—there will be some edits and a modern “post mortem”, which will be AI-generated inworld but mostly written my me in RL.
I’ve heard that hypothesis in a review of that blog post of Anthropic, likely by
AI Explained
maybe by
bycloud
.
They’ve called it “Chekov’s gun”.
Thanks! I couldn’t find that source, but “Chekhov’s Gun” is indeed mentioned in the original Anthropic post—albeit briefly. There’s also this Tumblr post (and the ensuing discussion, which I still need to digest fully), with a broader overview here.
While I have more reading to do, my provisional sense is that my main new proposal is to take the “narrative completion” hypothesis seriously as a core mechanism, and to explore how the experiment could be modified—rather than just joining the debate over the experiment’s validity.
I’m not convinced that “this experiment looks too much like a narrative to start with” means that narrative completion/fiction pattern-matching/Chekhov’s Gun effects aren’t important in practice. OpenAI’s recent “training for insecure coding/wrong answers” experiment (see here) arguably demonstrates similar effects in a more “realistic” domain.
Additionally, Elon Musk’s recent announcement about last-moment Grok 3.5/4 retraining on certain “divisive facts” (coverage) raises the prospect that narrative or cultural associations from that training could ripple into unrelated outputs. This is a short-term, real-world scenario from a major vendor—not a hypothetical.
That said, if the hypothesis has merit (and I emphasize that’s a big “if”), it’s worth empirical investigation. Given budget constraints, any experiment I run will necessarily be much smaller-scale compared to Anthropic’s original, but I hope to contribute constructively to the research discussion—not to spark a flame war. (The Musk example is relevant solely because potential effects of last-minute model training might “hit” in the short tiem, not for any commentary on Elon personally or politically.)
With this in mind, I guess it’s experiment first, full post second.
(Full disclosure: light editing and all link formatting was done by ChatGPT)