RSS

Mia Taylor

Karma: 137

Harm­less re­ward hacks can gen­er­al­ize to mis­al­ign­ment in LLMs

26 Aug 2025 17:32 UTC
46 points
6 comments7 min readLW link

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
110 points
15 comments5 min readLW link