RSS

Senthooran Rajamanoharan

Karma: 1,243

An­nounc­ing Gemma Scope 2

22 Dec 2025 21:56 UTC
88 points
1 comment2 min readLW link

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

22 Dec 2025 16:56 UTC
24 points
0 comments9 min readLW link

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
62 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
134 points
36 comments27 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
53 points
0 comments20 min readLW link

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

23 Jul 2025 14:57 UTC
79 points
8 comments5 min readLW link

Nar­row Misal­ign­ment is Hard, Emer­gent Misal­ign­ment is Easy

14 Jul 2025 21:05 UTC
133 points
24 comments5 min readLW link

Self-preser­va­tion or In­struc­tion Am­bi­guity? Ex­am­in­ing the Causes of Shut­down Resistance

14 Jul 2025 14:52 UTC
80 points
19 comments11 min readLW link

Con­ver­gent Lin­ear Rep­re­sen­ta­tions of Emer­gent Misalignment

16 Jun 2025 15:47 UTC
75 points
1 comment8 min readLW link

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
112 points
18 comments5 min readLW link

In­terim Re­search Re­port: Mechanisms of Awareness

2 May 2025 20:29 UTC
43 points
6 comments8 min readLW link

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
115 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

Take­aways From Our Re­cent Work on SAE Probing

3 Mar 2025 19:50 UTC
30 points
4 comments5 min readLW link

SAE Prob­ing: What is it good for?

1 Nov 2024 19:23 UTC
34 points
0 comments11 min readLW link