Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Robert Kirk
Karma:
28
All
Posts
Comments
New
Top
Old
Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking
Isaac Dunn
,
Kei Nishimura-Gasparian
,
Carson Denison
,
Ethan Perez
and
Robert Kirk
22 Dec 2025 19:33 UTC
17
points
0
comments
1
min read
LW
link
Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking
Isaac Dunn
,
Kei Nishimura-Gasparian
,
Carson Denison
,
Ethan Perez
and
Robert Kirk
22 Dec 2025 19:32 UTC
15
points
0
comments
30
min read
LW
link
Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations
smallsilo
,
Ian McKenzie
,
Oskar Hollinsworth
,
Tom Tseng
,
Xander Davies
,
scasper
,
Aaron Tucker
,
Robert Kirk
and
Adam Gleave
4 Jul 2025 0:07 UTC
13
points
1
comment
4
min read
LW
link
(far.ai)
Back to top