RSS

Joe Benton

Karma: 234

Rea­son­ing mod­els don’t always say what they think

9 Apr 2025 19:48 UTC
28 points
4 comments1 min readLW link
(www.anthropic.com)

Do mod­els say what they learn?

22 Mar 2025 15:19 UTC
116 points
12 comments13 min readLW link

Won’t vs. Can’t: Sand­bag­ging-like Be­hav­ior from Claude Models

19 Feb 2025 20:47 UTC
15 points
1 comment1 min readLW link
(alignment.anthropic.com)

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)