RSS

gersonkroiz

Karma: 184

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
42 points
0 comments10 min readLW link

Why Did My Model Do That? Model In­crim­i­na­tion for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
51 points
1 comment78 min readLW link

ger­son­kroiz’s Shortform

gersonkroiz21 Jan 2026 3:57 UTC
2 points
9 comments1 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
24 points
0 comments23 min readLW link

Can Models be Eval­u­a­tion Aware Without Ex­plicit Ver­bal­iza­tion?

8 Nov 2025 18:26 UTC
26 points
10 comments8 min readLW link