RSS

Senthooran Rajamanoharan

Karma: 970

Steer­ing Out-of-Distri­bu­tion Gen­er­al­iza­tion with Con­cept Abla­tion Fine-Tuning

23 Jul 2025 14:57 UTC
78 points
3 comments5 min readLW link

Nar­row Misal­ign­ment is Hard, Emer­gent Misal­ign­ment is Easy

14 Jul 2025 21:05 UTC
129 points
23 comments5 min readLW link

Self-preser­va­tion or In­struc­tion Am­bi­guity? Ex­am­in­ing the Causes of Shut­down Resistance

14 Jul 2025 14:52 UTC
66 points
18 comments11 min readLW link

Con­ver­gent Lin­ear Rep­re­sen­ta­tions of Emer­gent Misalignment

16 Jun 2025 15:47 UTC
66 points
0 comments8 min readLW link

Model Or­ganisms for Emer­gent Misalignment

16 Jun 2025 15:46 UTC
110 points
15 comments5 min readLW link

In­terim Re­search Re­port: Mechanisms of Awareness

2 May 2025 20:29 UTC
43 points
6 comments8 min readLW link

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

26 Mar 2025 19:07 UTC
113 points
15 comments29 min readLW link
(deepmindsafetyresearch.medium.com)

Take­aways From Our Re­cent Work on SAE Probing

3 Mar 2025 19:50 UTC
30 points
4 comments5 min readLW link

SAE Prob­ing: What is it good for?

1 Nov 2024 19:23 UTC
34 points
0 comments11 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

19 Jul 2024 16:10 UTC
55 points
10 comments1 min readLW link
(storage.googleapis.com)

Im­prov­ing Dic­tionary Learn­ing with Gated Sparse Autoencoders

25 Apr 2024 18:43 UTC
63 points
38 comments1 min readLW link
(arxiv.org)

[Full Post] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
80 points
10 comments8 min readLW link

[Sum­mary] Progress Up­date #1 from the GDM Mech In­terp Team

19 Apr 2024 19:06 UTC
73 points
0 comments3 min readLW link

Case Stud­ies in Re­v­erse-Eng­ineer­ing Sparse Au­toen­coder Fea­tures by Us­ing MLP Linearization

14 Jan 2024 2:06 UTC
24 points
0 comments42 min readLW link

Fact Find­ing: Do Early Lay­ers Spe­cial­ise in Lo­cal Pro­cess­ing? (Post 5)

23 Dec 2023 2:46 UTC
18 points
0 comments4 min readLW link

Fact Find­ing: How to Think About In­ter­pret­ing Me­mori­sa­tion (Post 4)

23 Dec 2023 2:46 UTC
22 points
0 comments9 min readLW link

Fact Find­ing: Try­ing to Mechanis­ti­cally Un­der­stand­ing Early MLPs (Post 3)

23 Dec 2023 2:46 UTC
10 points
1 comment16 min readLW link

Fact Find­ing: Sim­plify­ing the Cir­cuit (Post 2)

23 Dec 2023 2:45 UTC
25 points
3 comments14 min readLW link

Fact Find­ing: At­tempt­ing to Re­v­erse-Eng­ineer Fac­tual Re­call on the Neu­ron Level (Post 1)

23 Dec 2023 2:44 UTC
106 points
10 comments22 min readLW link2 reviews