RSS

Adrià Garriga-alonso

Karma: 1,227

Spar­sity is the en­emy of fea­ture ex­trac­tion (ft. ab­sorp­tion)

May 3, 2025, 10:13 AM
31 points
0 comments6 min readLW link

Among Us: A Sand­box for Agen­tic Deception

Apr 5, 2025, 6:24 AM
110 points
7 comments7 min readLW link

A Bunch of Ma­tryoshka SAEs

Apr 4, 2025, 2:53 PM
24 points
0 comments8 min readLW link

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

Mar 25, 2025, 2:33 PM
22 points
0 comments18 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Feb 7, 2025, 3:57 AM
29 points
0 comments10 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

Aug 23, 2024, 10:03 PM
17 points
0 comments25 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

Jun 24, 2024, 7:27 PM
96 points
4 comments8 min readLW link
(arxiv.org)

Catas­trophic Good­hart in RL with KL penalty

May 15, 2024, 12:58 AM
62 points
10 comments7 min readLW link

An eval­u­a­tion of cir­cuit eval­u­a­tion metrics

Apr 15, 2024, 7:38 PM
18 points
0 comments4 min readLW link

Ophiol­ogy (or, how the Mamba ar­chi­tec­ture works)

Apr 9, 2024, 7:31 PM
67 points
8 comments10 min readLW link

Does liter­acy re­move your abil­ity to be a bard as good as Homer?

Adrià Garriga-alonsoJan 18, 2024, 3:43 AM
51 points
19 comments3 min readLW link

Thomas Kwa’s re­search journal

Nov 23, 2023, 5:11 AM
79 points
1 comment6 min readLW link

On Fre­quen­tism and Bayesian Dogma

Oct 15, 2023, 10:23 PM
59 points
27 comments6 min readLW link

A com­par­i­son of causal scrub­bing, causal ab­strac­tions, and re­lated methods

Jun 8, 2023, 11:40 PM
73 points
3 comments22 min readLW link

Causal scrub­bing: re­sults on in­duc­tion heads

Dec 3, 2022, 12:59 AM
34 points
1 comment17 min readLW link

Causal scrub­bing: re­sults on a paren bal­ance checker

Dec 3, 2022, 12:59 AM
34 points
2 comments30 min readLW link

Causal scrub­bing: Appendix

Dec 3, 2022, 12:58 AM
18 points
4 comments20 min readLW link

Causal Scrub­bing: a method for rigor­ously test­ing in­ter­pretabil­ity hy­pothe­ses [Red­wood Re­search]

Dec 3, 2022, 12:58 AM
206 points
35 comments20 min readLW link1 review

The No Free Lunch the­o­rems and their Razor

Adrià Garriga-alonsoMay 24, 2022, 6:40 AM
56 points
3 comments9 min readLW link