Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
gersonkroiz
Karma:
116
All
Posts
Comments
New
Top
Old
gersonkroiz’s Shortform
gersonkroiz
21 Jan 2026 3:57 UTC
2
points
9
comments
1
min read
LW
link
Principled Interpretability of Reward Hacking in Closed Frontier Models
gersonkroiz
,
aditya singh
,
Senthooran Rajamanoharan
and
Neel Nanda
1 Jan 2026 16:37 UTC
22
points
0
comments
23
min read
LW
link
Can Models be Evaluation Aware Without Explicit Verbalization?
gersonkroiz
,
Greg Kocher
and
Tim Hua
8 Nov 2025 18:26 UTC
24
points
10
comments
8
min read
LW
link
Back to top