RSS

gersonkroiz

Karma: 116

ger­son­kroiz’s Shortform

gersonkroiz21 Jan 2026 3:57 UTC
2 points
9 comments1 min readLW link

Prin­ci­pled In­ter­pretabil­ity of Re­ward Hack­ing in Closed Fron­tier Models

1 Jan 2026 16:37 UTC
22 points
0 comments23 min readLW link

Can Models be Eval­u­a­tion Aware Without Ex­plicit Ver­bal­iza­tion?

8 Nov 2025 18:26 UTC
24 points
10 comments8 min readLW link