RSS

Jonathan Uesato

Karma: 380

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

16 Dec 2025 21:01 UTC
33 points
1 comment5 min readLW link
(alignment.anthropic.com)

Nat­u­ral emer­gent mis­al­ign­ment from re­ward hack­ing in pro­duc­tion RL

21 Nov 2025 20:00 UTC
262 points
32 comments9 min readLW link

Im­por­tance of fore­sight eval­u­a­tions within ELK

Jonathan Uesato6 Jan 2022 15:34 UTC
25 points
1 comment10 min readLW link

Draft pa­pers for REALab and De­cou­pled Ap­proval on tampering

28 Oct 2020 16:01 UTC
47 points
2 comments1 min readLW link