RSS

Anna Soligo

Karma: 331

Nar­row Misal­ign­ment is Hard, Emer­gent Misal­ign­ment is Easy

14 Jul 2025 21:05 UTC
129 points
23 comments5 min readLW link

Con­ver­gent Lin­ear Rep­re­sen­ta­tions of Emer­gent Misalignment

16 Jun 2025 15:47 UTC
65 points
0 comments8 min readLW link

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
109 points
13 comments5 min readLW link

FLAKE-Bench: Out­sourc­ing Awk­ward­ness in the Age of AI

1 Apr 2025 17:08 UTC
37 points
0 comments2 min readLW link

[Repli­ca­tion] Cross­coder-based Stage-Wise Model Diffing

22 Mar 2025 18:35 UTC
19 points
0 comments7 min readLW link