RSS

aditya singh

Karma: 102

How to De­sign En­vi­ron­ments for Un­der­stand­ing Model Motives

2 Mar 2026 7:14 UTC
42 points
0 comments10 min readLW link

Why Did My Model Do That? Model In­crim­i­na­tion for Di­ag­nos­ing LLM Misbehavior

27 Feb 2026 3:20 UTC
51 points
1 comment78 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
24 points
0 comments23 min readLW link