RSS

Sam Bowman

Karma: 1,902

https://​​cims.nyu.edu/​​~sbowman/​​

Build­ing and eval­u­at­ing al­ign­ment au­dit­ing agents

24 Jul 2025 19:22 UTC
46 points
1 comment5 min readLW link

Put­ting up Bumpers

Sam Bowman23 Apr 2025 16:05 UTC
54 points
14 comments2 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
141 points
15 comments13 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
489 points
75 comments10 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

18 Oct 2024 22:33 UTC
95 points
56 comments6 min readLW link
(assets.anthropic.com)

The Check­list: What Suc­ceed­ing at AI Safety Will In­volve

Sam Bowman3 Sep 2024 18:18 UTC
151 points
49 comments22 min readLW link
(sleepinyourhat.github.io)

Sim­ple probes can catch sleeper agents

23 Apr 2024 21:10 UTC
133 points
21 comments1 min readLW link
(www.anthropic.com)

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

17 Apr 2024 21:09 UTC
46 points
1 comment3 min readLW link
(tiny.cc)

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
89 points
14 comments9 min readLW link
(arxiv.org)

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

18 Jul 2023 16:36 UTC
111 points
15 comments6 min readLW link1 review