Misha Ramendik comments on Open Thread—Summer 2025

Misha Ramendik 26 Jun 2025 1:27 UTC
5 points
0
Hello!
I’m Misha, a “veteran” technical writer and an aspiring fiction writer. Yes, I am aware LessWrong is probably not the place to offer one’s fiction—maybe there are exceptions but I’m not aware of them. I have heard of LessWrong a lot, but didn’t join before because of perceived large volumes. However, I now hope to participate at least on the AI side.
I have been reading recent publications on AI misalignment, notably the big bangs from Anthropic https://www.anthropic.com/research/agentic-misalignment and OpenAI https://openai.com/index/emergent-misalignment/ .
I have my own hypothesis about a possible etiology of the observed cases of misalighment, alongside other theories (I don’t think it’s an all-or-nothing, an emergent behaviour can have several compounding etiologies).
My hypothesis involves “narrative completion”, that is, the LLM ending up in the “symantic well” of certain genres of fiction that it trained on and producing a likely continuation in the genre. Don Quixote read so many chivalry romance novels that he ended up living in one in his head; my suspicion is that this happens to LLMs rather easily.
I have not noticed this side of things discussed in the Anthropic and OpenAI papers, not could I find other papers discussing it.
I am gearing up to replicate Anthropic’s experiment based on their open soruce code, scaled down severely because of budget constraints. If I can get the results broadly in line with what they got, next I want to expand the options with some narrative-driven prompts, which, if my hypothesis works. should show significant reduction of observed misalignment.
Before doing this, ideally I’d like to make a post here explaining the hypothesis and suggested experiment in more detail. This would hopefully help me avoid blind spots and maybe add some more ideas for options to modify the experiment.
I would appreciate clarification on (a) if this is permitted and (b) if it is, then how I can make this post. Or if I can’t make a post at all as a newbie, then is there a suitable thread to put it in as a comment, either by just pasting the text or by putting it on Medium and pasting the link?
Thanks!
- Cole Wyeth 28 Jun 2025 0:16 UTC
  6 points
  6
  Parent
  Short fiction on lesswrong isn’t uncommon
  - Misha Ramendik 30 Jun 2025 2:08 UTC
    1 point
    0
    Parent
    Thank you! Now, one question: is a degree of AI involvement acceptable?
    Thing is, I have an AI story I wrote in 2012 that kinda “hit the bullseye”, but the thing is in Russian. I would get an English version done much quicker if I could use an LLM draft translation, but this disqualifies it from many places.
    - Cole Wyeth 30 Jun 2025 2:15 UTC
      6 points
      2
      Parent
      https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong
      - Misha Ramendik 30 Jun 2025 22:43 UTC
        1 point
        0
        Parent
        Thanks! Very useful
    - duck_master 30 Jun 2025 2:29 UTC
      3 points
      1
      Parent
      Since you wrote the original, machine translation (which is pretty decent these days) should be fine, because it’s not really generating the English version from scratch. Even Google Translate is okayish.
      - Misha Ramendik 30 Jun 2025 22:42 UTC
        2 points
        0
        Parent
        I don’t even want to post an unedited machine translation—there will be some edits and a modern “post mortem”, which will be AI-generated inworld but mostly written my me in RL.
- Martin Vlach 29 Jun 2025 22:11 UTC
  2 points
  0
  Parent
  I’ve heard that hypothesis in a review of that blog post of Anthropic, likely by
  AI Explained
  maybe by
  bycloud
  .
  They’ve called it “Chekov’s gun”.
  - Misha Ramendik 30 Jun 2025 2:25 UTC
    2 points
    0
    Parent
    Thanks! I couldn’t find that source, but “Chekhov’s Gun” is indeed mentioned in the original Anthropic post—albeit briefly. There’s also this Tumblr post (and the ensuing discussion, which I still need to digest fully), with a broader overview here.
    While I have more reading to do, my provisional sense is that my main new proposal is to take the “narrative completion” hypothesis seriously as a core mechanism, and to explore how the experiment could be modified—rather than just joining the debate over the experiment’s validity.
    I’m not convinced that “this experiment looks too much like a narrative to start with” means that narrative completion/fiction pattern-matching/Chekhov’s Gun effects aren’t important in practice. OpenAI’s recent “training for insecure coding/wrong answers” experiment (see here) arguably demonstrates similar effects in a more “realistic” domain.
    Additionally, Elon Musk’s recent announcement about last-moment Grok 3.5/4 retraining on certain “divisive facts” (coverage) raises the prospect that narrative or cultural associations from that training could ripple into unrelated outputs. This is a short-term, real-world scenario from a major vendor—not a hypothetical.
    That said, if the hypothesis has merit (and I emphasize that’s a big “if”), it’s worth empirical investigation. Given budget constraints, any experiment I run will necessarily be much smaller-scale compared to Anthropic’s original, but I hope to contribute constructively to the research discussion—not to spark a flame war. (The Musk example is relevant solely because potential effects of last-minute model training might “hit” in the short tiem, not for any commentary on Elon personally or politically.)
    With this in mind, I guess it’s experiment first, full post second.
    
    (Full disclosure: light editing and all link formatting was done by ChatGPT)