gersonkroiz

Karma: 202

How to Design Environments for Understanding Model Motives

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

2 Mar 2026 7:14 UTC

46 points

0 comments10 min readLW link

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

27 Feb 2026 3:20 UTC

60 points

12 comments78 min readLW link

gersonkroiz’s Shortform

gersonkroiz21 Jan 2026 3:57 UTC

2 points

9 comments1 min readLW link

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

1 Jan 2026 16:37 UTC

25 points

0 comments23 min readLW link

Can Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz, Greg Kocher and Tim Hua

8 Nov 2025 18:26 UTC

26 points

10 comments8 min readLW link