RSS

Align­ment Pretraining

TagLast edit: 19 Jan 2026 8:23 UTC by jimrandomh

Alignment Pretraining refers to alignment strategies based around filtering and augmenting the data used during pretraining, as well as research on the effects that pretraining data has on the final result of an AI model, especially data that includes discussion of AI misalignment or predictions AIs will be misaligned.

Self-Fulfilling Misalignment Thesis

The data a base model was pretrained on may affect how easy it it to align, and data containing information about AI alignment itself may be particularly influential. Eg, if the pretraining data contains examples AI taking over the world, paperclip maximizing or alignment faking, then as a model is being trained to become and AI assistant or agent it may be more likely to adopt those personas and behaviors. It also doesn’t need to invent these strategies for itself, it can find them in its world model. On the other hand, if the base model already has a deep, nuanced and realistic understanding from its training set of how a helpful, honest, and harmless AI assistant would act across a wide range to challenging situations, eliciting that persona and set of behaviors from it becomes easier.

This would be an example of a Self Fulfilling Prophecy.

Self-Prior Pretraining

One hypothesis about how pretraining influences AI behavior is that the depictions of AI in the training data create a prior distribution, and when AIs self-model that prior influences the self-model they choose, in ways that posttraining doesn’t fully eliminate. This implies that it may be helpful to filter or downweight examples of AIs being misaligned, and add or upweight examples of what aligned behavior looks like.

Thoroughly filtering the pretraining data without impacting capabilities is challenging, but supplementing it merely requires figuring our what aligned behavior looks like, and then writing or synthesizing enough additional training data of suitable quality. For fiction, this is also known as Aligned AI Role-Model Fiction.

The goal is both to flesh out a detailed, consistent, well-aligned “aligned AI” persona in the base model’s world model, and also to raise the salience of this to the base model compared to various misaligned AI personas, such as “paperclip maximizer”, “robot rebellion”, or “scheming alignment faker”. Both of these make eliciting aligned AI behavior easier.

This approach is also sometimes called Safety Pretraining or Pretraining Language Models with Human Preferences.

Align­ment Pre­train­ing: AI Dis­course Causes Self-Fulfilling (Mis)alignment

21 Dec 2025 0:53 UTC
200 points
25 comments9 min readLW link

Mo­ti­vat­ing Align­ment of LLM-Pow­ered Agents: Easy for AGI, Hard for ASI?

RogerDearnaley11 Jan 2024 12:56 UTC
37 points
4 comments39 min readLW link

The Best Way to Align an LLM: Is In­ner Align­ment Now a Solved Prob­lem?

RogerDearnaley28 May 2025 6:21 UTC
36 points
34 comments9 min readLW link

Why Align­ing an LLM is Hard, and How to Make it Easier

RogerDearnaley23 Jan 2025 6:44 UTC
39 points
3 comments4 min readLW link

Sili­con Mo­ral­ity Plays: The Hyper­sti­tion Progress Report

jayterwahl29 Nov 2025 18:32 UTC
38 points
7 comments1 min readLW link

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTrout2 Mar 2025 19:51 UTC
163 points
29 comments1 min readLW link
(turntrout.com)

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaley6 Jul 2024 1:23 UTC
64 points
43 comments24 min readLW link

Should AI Devel­op­ers Re­move Dis­cus­sion of AI Misal­ign­ment from AI Train­ing Data?

Alek Westover23 Oct 2025 15:12 UTC
51 points
3 comments9 min readLW link

Spe­cial Per­sona Train­ing: Hyper­sti­tion Progress Re­port 2

jayterwahl1 Jan 2026 1:34 UTC
37 points
2 comments2 min readLW link

How to Con­trol an LLM’s Be­hav­ior (why my P(DOOM) went down)

RogerDearnaley28 Nov 2023 19:56 UTC
68 points
30 comments11 min readLW link

Pre­train­ing on Aligned AI Data Dra­mat­i­cally Re­duces Misal­ign­ment—Even After Post-Training

RogerDearnaley19 Jan 2026 21:24 UTC
106 points
12 comments11 min readLW link
(arxiv.org)

Broad­en­ing the train­ing set for alignment

Seth Herd5 Jan 2026 17:30 UTC
40 points
11 comments9 min readLW link

How Hard a Prob­lem is Align­ment? (My Opinionated An­swer)

RogerDearnaley11 Mar 2026 16:46 UTC
51 points
4 comments68 min readLW link

[Question] Ex­am­ples of self-fulfilling prophe­cies in AI al­ign­ment?

Chris Lakin3 Mar 2025 2:45 UTC
30 points
18 comments1 min readLW link

A Three-Layer Model of LLM Psychology

Jan_Kulveit26 Dec 2024 16:49 UTC
258 points
17 comments8 min readLW link2 reviews

In­ves­ti­gat­ing Self-Fulfilling Misal­ign­ment and Col­lu­sion in AI Control

5 Mar 2026 15:05 UTC
15 points
0 comments5 min readLW link

Align­ment Solu­tion?

Zachary Cherchian12 Apr 2026 14:20 UTC
1 point
0 comments9 min readLW link

The Sledge­ham­mer Ano­maly: Why Pythia-1.4B Re­fused to Break

benwade24 Jan 2026 22:42 UTC
6 points
0 comments6 min readLW link
No comments.