Senthooran Rajamanoharan

Karma: 1,922

How well do models follow their constitutions?

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Mar 2026 0:07 UTC

99 points

5 comments26 min readLW link

Current activation oracles are hard to use

aryaj, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2026 19:33 UTC

80 points

4 comments16 min readLW link

How to Design Environments for Understanding Model Motives

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

2 Mar 2026 7:14 UTC

45 points

0 comments10 min readLW link

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

27 Feb 2026 3:20 UTC

60 points

12 comments78 min readLW link

models have some pretty funny attractor states

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Feb 2026 21:14 UTC

274 points

38 comments18 min readLW link

Test your interpretability techniques by de-censoring Chinese models

Khoi Tran, aryaj, Senthooran Rajamanoharan and Neel Nanda

15 Jan 2026 16:33 UTC

91 points

14 comments20 min readLW link

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

39 points

1 comment11 min readLW link

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

1 Jan 2026 16:37 UTC

24 points

0 comments23 min readLW link

Announcing Gemma Scope 2

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan and Neel Nanda

22 Dec 2025 21:56 UTC

96 points

1 comment2 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

67 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

136 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

79 points

8 comments5 min readLW link

Narrow Misalignment is Hard, Emergent Misalignment is Easy

Edward Turner, Anna Soligo, Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 21:05 UTC

138 points

24 comments5 min readLW link

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Senthooran Rajamanoharan and Neel Nanda

14 Jul 2025 14:52 UTC

86 points

19 comments11 min readLW link

Convergent Linear Representations of Emergent Misalignment

Anna Soligo, Edward Turner, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:47 UTC

76 points

1 comment8 min readLW link

Model Organisms for Emergent Misalignment

Anna Soligo, Edward Turner, Mia Taylor, Senthooran Rajamanoharan and Neel Nanda

16 Jun 2025 15:46 UTC

118 points

19 comments5 min readLW link

Interim Research Report: Mechanisms of Awareness

Josh Engels, Neel Nanda and Senthooran Rajamanoharan

2 May 2025 20:29 UTC

43 points

6 comments8 min readLW link

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

26 Mar 2025 19:07 UTC

117 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)