[Question] Examples of self-fulfilling prophecies in AI alignment?

Chris Lakin3 Mar 2025 2:45 UTC

30 points

AI Alignment Pretraining Hyperstitions Self Fulfilling/Refuting Prophecies

Like Self-fulfilling misalignment data might be poisoning our AI models, what are historical examples of self-fulfilling prophecies that have affected AI alignment and development?

Put a few potential examples below to seed discussion.

What links here?

Pretraining on Aligned AI Data Dramatically Reduces Misalignment—Even After Post-Training by RogerDearnaley (19 Jan 2026 21:24 UTC; 106 points)
Chris Lakin's comment on Self-fulfilling misalignment data might be poisoning our AI models by TurnTrout (3 Mar 2025 2:54 UTC; 5 points)
Chris Lakin's comment on On Pessimization by Richard_Ngo (17 Aug 2025 2:07 UTC; 2 points)

Chris Lakin3 Mar 2025 2:45 UTC

30 points

20 comments1 min readLW link

AI Alignment Pretraining Hyperstitions Self Fulfilling/Refuting Prophecies

Chris Lakin 3 Mar 2025 2:50 UTC
11 points
0
https://x.com/sama/status/1621621724507938816
Kaj_Sotala 21 Dec 2025 9:50 UTC
8 points
0
https://www.lesswrong.com/posts/TcfyGD2aKdZ7Rt3hk/alignment-pretraining-ai-discourse-causes-self-fulfilling
LLMs pretrained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned.
- Chris Lakin 22 Dec 2025 3:24 UTC
  4 points
  0
  Parent
  thank you
Chris Lakin 3 Mar 2025 2:46 UTC
8 points
0
Training on Documents About Reward Hacking Induces Reward Hacking
Chris Lakin 3 Mar 2025 2:47 UTC
7 points
0
Situational Awareness and race dynamics? h/t Jan Kulveit @Jan_Kulveit
- Xavi CF 5 Apr 2025 22:59 UTC
  3 points
  0
  Parent
  Situational Awareness probably caused Project Stargate to some extent. Getting the Republican party to take AI seriously enough to let them launch in the White House is no joke and less likely without the essay.
  It also started the website-essay meta which is part of why AI 2027, The Compendium, and Gradual Disempowerment all launched the way they did, so there are knock-on effects too.
Kaj_Sotala 24 Nov 2025 17:00 UTC
6 points
0
nostalgebraist’s self-fulfilling way of getting personas into a language model
A kind of wacky approach that I feel somewhat optimistic about would be to “get the personas into the world model” via SDF on documents that effectively say “Anthropic is using this new alignment technique where they create these two personas, A and G,” and go on to describe the entire procedure… and to say that in practice, the procedure “worked”—the latest model has these two personas, and they have the properties which you hoped they would have.
At the outset, you’re unsure whether the technique works, but you do know that SDF is fairly effective at putting facts into the world model. So, you can reason as follows:
- I’ll put “my technique works” into the LM’s world model via SDF...
- ...which will make the assistant personas in the LM’s world model behave as they would have if the technique had in fact worked...
- ...but then, since “the assistant personas behave as they would have if the technique had worked” is what it means for the technique to have worked, we’re done: the technique did in fact work, in the real world!
What’s more, insofar as this “self-fulfilling success” story seems like a plausible mechanism to us, it may therefore also seem like a plausible mechanism to a highly capable LM encountering the synthetic documents.
This LM would start out in a similar epistemic state to our own: it can’t always predict which way things will generalize, just like we can’t, but just like us it knows (from reading papers) that SDF is effective. So even if the LM is very smart and won’t always accept arbitrary claims like “this technique just magically worked for some unknown reason, how convenient!”,^[2] it might nevertheless accept the argument that “this technique worked because ‘training on documents which claim that it works’ is enough to bootstrap it into working.” And once the LM accepts this is plausible, it thereby becomes not just plausible but true.
A nice property of this is that, if it works, it is therefore “honest”: by the end, the synthetic documents describe the real world accurately (although, unusually, this occurred by modifying the real world to match the documents, via training the LM on them and thus producing the kind of model they describe). So you don’t end up in some awkward state where you had to disrupt the accuracy of the world model in the course of aligning the character.
Chris Lakin 1 Nov 2025 15:52 UTC
4 points
0
https://x.com/g_leech_/status/1984587261120233577
Anthropic, Dec 2024 vs May 2025
[vs]
https://x.com/RyanPGreenblatt/status/1983945951342739812
Chris Lakin 14 Oct 2025 5:18 UTC
4 points
0
Aaron Silverbrook is going for it:
Aaron Silverbook, $5K, for approximately five thousand novels about AI going well. This one requires some background: critics claim that since AI absorbs text as training data and then predicts its completion, talking about dangerous AI too much might “hyperstition” it into existence. Along with the rest of the AI Futures Project, I wrote a skeptical blog post, which ended by asking—if this were true, it would be great, right? You could just write a few thousand books about AI behaving well, and alignment would be solved! At the time, I thought I was joking. Enter Aaron, who you may remember from his previous adventures in mad dental science. He and a cofounder have been working on an “AI fiction publishing house” that considers itself state-of-the-art in producing slightly-less-sloplike AI slop than usual. They offered to literally produce several thousand book-length stories about AI behaving well and ushering in utopia, on the off chance that this helps. Our grant will pay for compute. We’re still working on how to get this included in training corpuses. He would appreciate any plot ideas you could give him to use as prompts.
(tweet)
- Chris Lakin 9 May 2026 1:06 UTC
  4 points
  0
  Parent
  https://www.hyperstitionai.com
  Aaron Silverbrook, today:
  Hyperstition AI has shipped two open-source corpora (5,000 novels + 40,000 short stories, ~500M tokens), developed the generation pipelines, and collaborated with Geodesic Research on the experiments validating their alignment pretraining. https://alignmentpretraining.ai/
  Our second corpus was to test Turntrout’s hypothesis for a filtered data set eliding all mention of AI and instead trying to convince the model that it was some kind of benevolent glass angel being https://turntrout.com/self-fulfilling-misalignment
  Here’s both open-source corpora https://huggingface.co/datasets/jayterwahl/hyperstition
DivineMango 3 Apr 2025 20:51 UTC
3 points
0
Superintelligence Strategy is pretty explicitly trying to be self-fulfilling, e.g. “This dynamic stabilizes the strategic landscape without lengthy treaty negotiations—all that is necessary is that states collectively recognize their strategic situation” (which this paper popularly argues exists in the first place)
Chris Lakin 3 Apr 2025 19:39 UTC
3 points
−2
https://x.com/saffronhuang/status/1907863453009867183
Chris Lakin 8 May 2026 23:16 UTC
2 points
0
https://x.com/AnthropicAI/status/2052808801040859392 ?
Chris Lakin 31 Mar 2026 1:09 UTC
2 points
0
https://x.com/tomekkorbak/status/2038704753887379891
New OpenAI post: Can midtraining on docs about aligned AI bake in alignment priors for agents? We report an experiment where those priors are quickly washed away by RL and fail to generalize to agentic settings. But that cuts both ways: priors that AIs are misaligned fade too!
https://alignment.openai.com/how-far-does-alignment-midtraining-generalize/
Chris Lakin 21 Mar 2026 17:48 UTC
2 points
0
https://x.com/nabeelqu/status/2035051760746659904
Chris Lakin 21 Mar 2026 17:46 UTC
2 points
−2
(more in Richard’s thread)
in text form:
Richard Ngo @RichardMCNgo
The world’s most consequential Molochian dynamic is the race between AGI companies.
But several of the leading companies were significantly influenced by @slatestarcodex (e.g. as below)
So perhaps we should think of Scott as having *summoned* Moloch.
Quote
Sam Altman @sama Mar 13, 2022
the blog post I think about most often: https://slatestarcodex.com/2014/07/30/meditations-on-moloch/
(other than a bunch of PG essays, which I somehow don’t count as blog posts)
Chris Lakin 9 Mar 2026 1:36 UTC
2 points
0
Promoting enmity and bad vibes around AI safety
Chris Lakin 28 Feb 2026 22:43 UTC
2 points
0
gavin leech:
maybe a way to anchor the models on stable preferences is to treat them as if they had them and get this idea into the training data.
https://x.com/g_leech_/status/2027416051198083142
Chris Lakin 27 Jan 2026 3:25 UTC
2 points
0
Avoid doomerism. Here, I mean “doomerism” not just in the sense of believing doom is inevitable (which is both a false and self-fulfilling belief), but more generally, thinking about AI risks in a quasi-religious way.
https://www.darioamodei.com/essay/the-adolescence-of-technology
Chris Lakin 16 Jul 2025 1:28 UTC
2 points
0
Grok prompts lately, kinda “Don’t think about elephants”
- https://x.com/bitcloud/status/1942792945238983022

[ ]
[deleted]