Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Jonathan Uesato
Karma:
380
All
Posts
Comments
New
Top
Old
Towards training-time mitigations for alignment faking in RL
Vlad Mikulik
,
gasteigerjo
,
Hoagy
,
Joe Benton
,
Benjamin Wright
,
Jonathan Uesato
,
Monte M
,
Fabien Roger
and
evhub
16 Dec 2025 21:01 UTC
33
points
1
comment
5
min read
LW
link
(alignment.anthropic.com)
Natural emergent misalignment from reward hacking in production RL
evhub
,
Monte M
,
Benjamin Wright
and
Jonathan Uesato
21 Nov 2025 20:00 UTC
262
points
32
comments
9
min read
LW
link
Importance of foresight evaluations within ELK
Jonathan Uesato
6 Jan 2022 15:34 UTC
25
points
1
comment
10
min read
LW
link
Draft papers for REALab and Decoupled Approval on tampering
Jonathan Uesato
and
Ramana Kumar
28 Oct 2020 16:01 UTC
47
points
2
comments
1
min read
LW
link
Back to top