RSS

Anna Soligo

Karma: 346

Nar­row Misal­ign­ment is Hard, Emer­gent Misal­ign­ment is Easy

14 Jul 2025 21:05 UTC
133 points
24 comments5 min readLW link

Con­ver­gent Lin­ear Rep­re­sen­ta­tions of Emer­gent Misalignment

16 Jun 2025 15:47 UTC
74 points
1 comment8 min readLW link

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
112 points
18 comments5 min readLW link

FLAKE-Bench: Out­sourc­ing Awk­ward­ness in the Age of AI

1 Apr 2025 17:08 UTC
37 points
0 comments2 min readLW link

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

22 Mar 2025 18:35 UTC
24 points
0 comments7 min readLW link