RSS

Hoagy

Karma: 1,129

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
5 points
0 comments5 min readLW link
(alignment.anthropic.com)

Train­ing fails to elicit sub­tle rea­son­ing in cur­rent lan­guage models

9 Oct 2025 19:04 UTC
49 points
3 comments4 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
142 points
15 comments13 min readLW link

Some ad­di­tional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC
31 points
4 comments13 min readLW link