RSS

Adrià Garriga-alonso

Karma: 1,230

Spar­sity is the en­emy of fea­ture ex­trac­tion (ft. ab­sorp­tion)

May 3, 2025, 10:13 AM
32 points
0 comments6 min readLW link

Among Us: A Sand­box for Agen­tic Deception

Apr 5, 2025, 6:24 AM
110 points
7 comments7 min readLW link

A Bunch of Ma­tryoshka SAEs

Apr 4, 2025, 2:53 PM
25 points
0 comments8 min readLW link

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

Mar 25, 2025, 2:33 PM
22 points
0 comments18 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Feb 7, 2025, 3:57 AM
29 points
0 comments10 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

Aug 23, 2024, 10:03 PM
17 points
0 comments25 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

Jun 24, 2024, 7:27 PM
97 points
4 comments8 min readLW link
(arxiv.org)

Catas­trophic Good­hart in RL with KL penalty

May 15, 2024, 12:58 AM
62 points
10 comments7 min readLW link

An eval­u­a­tion of cir­cuit eval­u­a­tion metrics

15 Apr 2024 19:38 UTC
18 points
0 comments4 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

9 Apr 2024 19:31 UTC
67 points
8 comments10 min readLW link