Rohin Shah

Karma: 16,402

Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Vlad Mikulik and Rohin Shah

15 Jul 2025 16:23 UTC

166 points

32 comments1 min readLW link

(bit.ly)

Evaluating and monitoring for AI scheming

Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho and Rohin Shah

10 Jul 2025 14:24 UTC

57 points

9 comments5 min readLW link

(deepmindsafetyresearch.medium.com)

Google DeepMind: An Approach to Technical AGI Safety and Security

Rohin Shah5 Apr 2025 22:00 UTC

73 points

12 comments18 min readLW link

(arxiv.org)

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

113 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

AGI Safety & Alignment @ Google DeepMind is hiring

Rohin Shah17 Feb 2025 21:11 UTC

103 points

19 comments10 min readLW link

A short course on AGI safety from the GDM Alignment team

Vika and Rohin Shah

14 Feb 2025 15:43 UTC

104 points

2 comments1 min readLW link

(deepmindsafetyresearch.medium.com)

MONA: Managed Myopia with Approval Feedback

Seb Farquhar, David Lindner and Rohin Shah

23 Jan 2025 12:24 UTC

81 points

30 comments9 min readLW link

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work

Rohin Shah, Seb Farquhar and Anca Dragan

20 Aug 2024 16:22 UTC

222 points

33 comments9 min readLW link

On scalable oversight with weak LLMs judging strong LLMs

zac_kenton, Noah Siegel, janos, Jonah Brown-Cohen, Samuel Albanie, David Lindner and Rohin Shah

8 Jul 2024 8:59 UTC

49 points

18 comments7 min readLW link

(arxiv.org)

Improving Dictionary Learning with Gated Sparse Autoencoders

Senthooran Rajamanoharan, Arthur Conmy, lewis smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah and Neel Nanda

25 Apr 2024 18:43 UTC

63 points

38 comments1 min readLW link

(arxiv.org)

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Neel Nanda, János Kramár, Tom Lieberum and Rohin Shah

18 Mar 2024 17:28 UTC

19 points

0 comments1 min readLW link

(arxiv.org)

Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

18 points

0 comments4 min readLW link

Fact Finding: How to Think About Interpreting Memorisation (Post 4)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

22 points

0 comments9 min readLW link

Fact Finding: Trying to Mechanistically Understanding Early MLPs (Post 3)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:46 UTC

10 points

1 comment16 min readLW link

Fact Finding: Simplifying the Circuit (Post 2)

Senthooran Rajamanoharan, Neel Nanda, János Kramár and Rohin Shah

23 Dec 2023 2:45 UTC

25 points

3 comments14 min readLW link

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Neel Nanda, Senthooran Rajamanoharan, János Kramár and Rohin Shah

23 Dec 2023 2:44 UTC

106 points

10 comments22 min readLW link 2 reviews

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

149 points

21 comments10 min readLW link

Explaining grokking through circuit efficiency

Vikrant Varma and Rohin Shah

8 Sep 2023 14:39 UTC

101 points

11 comments3 min readLW link

(arxiv.org)

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

Shah (DeepMind) and Leahy (Conjecture) Discuss Alignment Cruxes

Olive Branch, Rohin Shah, Connor Leahy and Andrea_Miotti

1 May 2023 16:47 UTC

96 points

10 comments30 min readLW link