RSS

Rohin Shah

Karma: 16,171

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://​​rohinshah.com/​​

Chain of Thought Mon­i­tora­bil­ity: A New and Frag­ile Op­por­tu­nity for AI Safety

15 Jul 2025 16:23 UTC
166 points
32 comments1 min readLW link
(bit.ly)

Eval­u­at­ing and mon­i­tor­ing for AI scheming

10 Jul 2025 14:24 UTC
52 points
9 comments5 min readLW link
(deepmindsafetyresearch.medium.com)

Google Deep­Mind: An Ap­proach to Tech­ni­cal AGI Safety and Security

Rohin Shah5 Apr 2025 22:00 UTC
73 points
12 comments18 min readLW link
(arxiv.org)

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
113 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

AGI Safety & Align­ment @ Google Deep­Mind is hiring

Rohin Shah17 Feb 2025 21:11 UTC
102 points
19 comments10 min readLW link

A short course on AGI safety from the GDM Align­ment team

14 Feb 2025 15:43 UTC
103 points
2 comments1 min readLW link
(deepmindsafetyresearch.medium.com)

MONA: Man­aged My­opia with Ap­proval Feedback

23 Jan 2025 12:24 UTC
81 points
30 comments9 min readLW link

AGI Safety and Align­ment at Google Deep­Mind: A Sum­mary of Re­cent Work

20 Aug 2024 16:22 UTC
222 points
33 comments9 min readLW link

On scal­able over­sight with weak LLMs judg­ing strong LLMs

8 Jul 2024 8:59 UTC
49 points
18 comments7 min readLW link
(arxiv.org)

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

AtP*: An effi­cient and scal­able method for lo­cal­iz­ing LLM be­havi­our to components

18 Mar 2024 17:28 UTC
19 points
0 comments1 min readLW link
(arxiv.org)

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

23 Dec 2023 2:46 UTC
18 points
0 comments4 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

23 Dec 2023 2:46 UTC
22 points
0 comments9 min readLW link

Fact Find­ing: Try­ing to Mechanis­ti­cally Un­der­stand­ing Early MLPs (Post 3)

23 Dec 2023 2:46 UTC
10 points
1 comment16 min readLW link

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

23 Dec 2023 2:45 UTC
25 points
3 comments14 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
106 points
10 comments22 min readLW link2 reviews

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

18 Dec 2023 11:58 UTC
149 points
21 comments10 min readLW link

Ex­plain­ing grokking through cir­cuit efficiency

8 Sep 2023 14:39 UTC
101 points
11 comments3 min readLW link
(arxiv.org)

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
44 points
3 comments2 min readLW link
(arxiv.org)

Shah (Deep­Mind) and Leahy (Con­jec­ture) Dis­cuss Align­ment Cruxes

1 May 2023 16:47 UTC
96 points
10 comments30 min readLW link