RSS

Akbir Khan

Karma: 353

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

8 Apr 2025 17:32 UTC
146 points
20 comments12 min readLW link

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
141 points
15 comments13 min readLW link

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

7 Feb 2024 21:28 UTC
89 points
14 comments9 min readLW link
(arxiv.org)

Why multi-agent safety is im­por­tant

Akbir Khan14 Jun 2022 9:23 UTC
10 points
2 comments10 min readLW link

Why we need proso­cial agents

Akbir Khan2 Nov 2021 15:19 UTC
7 points
0 comments2 min readLW link