RSS

Emer­gent Misalignment

TagLast edit: 27 Feb 2026 3:20 UTC by RogerDearnaley

Training on narrow examples of misaligned behavior sometimes extrapolates to broadly misaligned behavior, seemingly altering the assistant’s goals or persona rather than just training on that specific behavior

Ex­per­i­men­tal Ev­i­dence for Si­mu­la­tor The­ory— Part 1: Emer­gent Misal­ign­ment and Weird Generalizations

RogerDearnaley23 Mar 2026 22:37 UTC
25 points
0 comments53 min readLW link

Ex­per­i­men­tal Ev­i­dence for Si­mu­la­tor The­ory— Part 2: The Scalers Strike Back

RogerDearnaley23 Mar 2026 22:37 UTC
21 points
0 comments34 min readLW link

On Emer­gent Misalignment

Zvi28 Feb 2025 13:10 UTC
95 points
5 comments22 min readLW link
(thezvi.wordpress.com)

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
120 points
19 comments5 min readLW link

Will Any Crap Cause Emer­gent Misal­ign­ment?

J Bostock27 Aug 2025 18:20 UTC
204 points
38 comments3 min readLW link

Self-Recog­ni­tion Fine­tun­ing can Re­v­erse and Prevent Emer­gent Misalignment

15 Mar 2026 0:11 UTC
47 points
24 comments7 min readLW link

Emer­gent mis­al­ign­ment ev­i­dent in ac­ti­va­tions at low poi­son­ing doses—long be­fore be­hav­ioral checks flag it

burnssa27 Apr 2026 1:15 UTC
15 points
0 comments5 min readLW link

Con­text Aware­ness: Con­sti­tu­tional AI can miti­gate Emer­gent Misalignement

2 Mar 2026 5:21 UTC
25 points
18 comments36 min readLW link
No comments.