RSS

Emer­gent Misalignment

TagLast edit: 27 Feb 2026 3:20 UTC by RogerDearnaley

Training on narrow examples of misaligned behavior sometimes extrapolates to broadly misaligned behavior, seemingly altering the assistant’s goals or persona rather than just training on that specific behavior

Ex­per­i­men­tal Ev­i­dence for Si­mu­la­tor The­ory— Part 1: Emer­gent Misal­ign­ment and Weird Generalizations

RogerDearnaley23 Mar 2026 22:37 UTC
25 points
0 comments53 min readLW link

Ex­per­i­men­tal Ev­i­dence for Si­mu­la­tor The­ory— Part 2: The Scalers Strike Back

RogerDearnaley23 Mar 2026 22:37 UTC
21 points
0 comments34 min readLW link

On Emer­gent Misalignment

Zvi28 Feb 2025 13:10 UTC
95 points
5 comments22 min readLW link
(thezvi.wordpress.com)

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
118 points
19 comments5 min readLW link

Will Any Crap Cause Emer­gent Misal­ign­ment?

J Bostock27 Aug 2025 18:20 UTC
198 points
38 comments3 min readLW link

Self-Recog­ni­tion Fine­tun­ing can Re­v­erse and Prevent Emer­gent Misalignment

15 Mar 2026 0:11 UTC
51 points
23 comments7 min readLW link

Con­text Aware­ness: Con­sti­tu­tional AI can miti­gate Emer­gent Misalignement

2 Mar 2026 5:21 UTC
25 points
15 comments36 min readLW link
No comments.