RSS

Adrià Garriga-alonso

Karma: 1,751

How much su­per­po­si­tion is there?

18 Feb 2026 13:53 UTC
25 points
0 comments3 min readLW link

Cir­cuit dis­cov­ery through chain of thought us­ing policy gradients

Adrià Garriga-alonso29 Nov 2025 7:56 UTC
15 points
2 comments4 min readLW link

Will We Get Align­ment by De­fault? — with Adrià Gar­riga-Alonso

27 Nov 2025 19:19 UTC
50 points
3 comments1 min readLW link
(simonlermen.substack.com)

Spa­tially dis­tributed con­scious­ness is not an ab­stract thought ex­per­i­ment if AI is conscious

Adrià Garriga-alonso26 Nov 2025 4:59 UTC
17 points
1 comment4 min readLW link
(open.substack.com)

Align­ment will hap­pen by de­fault. What’s next?

Adrià Garriga-alonso25 Nov 2025 7:14 UTC
99 points
133 comments6 min readLW link
(open.substack.com)

In­finites­i­mally False

21 Nov 2025 4:57 UTC
52 points
16 comments12 min readLW link

A scheme to credit hack policy gra­di­ent training

Adrià Garriga-alonso7 Nov 2025 6:24 UTC
15 points
0 comments5 min readLW link

An­thropic’s JumpReLU train­ing method is re­ally good

3 Oct 2025 15:23 UTC
51 points
2 comments2 min readLW link

A re­cur­rent CNN finds maze paths by filling dead-ends

Adrià Garriga-alonso15 Sep 2025 20:49 UTC
19 points
0 comments2 min readLW link

The “Spar­sity vs Re­con­struc­tion Trade­off” Illusion

26 Aug 2025 4:39 UTC
21 points
0 comments4 min readLW link

L0 is not a neu­tral hyperparameter

19 Jul 2025 13:51 UTC
24 points
3 comments5 min readLW link

Can We Change the Goals of a Toy RL Agent?

15 Jun 2025 20:34 UTC
20 points
0 comments9 min readLW link

Spar­sity is the en­emy of fea­ture ex­trac­tion (ft. ab­sorp­tion)

3 May 2025 10:13 UTC
32 points
0 comments6 min readLW link

Among Us: A Sand­box for Agen­tic Deception

5 Apr 2025 6:24 UTC
114 points
7 comments7 min readLW link

A Bunch of Ma­tryoshka SAEs

4 Apr 2025 14:53 UTC
29 points
0 comments8 min readLW link

Fea­ture Hedg­ing: Another way cor­re­lated fea­tures break SAEs

25 Mar 2025 14:33 UTC
23 points
0 comments18 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

7 Feb 2025 3:57 UTC
37 points
0 comments10 min readLW link

Craft­ing Poly­se­man­tic Trans­former Bench­marks with Known Circuits

23 Aug 2024 22:03 UTC
17 points
0 comments25 min readLW link

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

25 Jul 2024 22:00 UTC
59 points
8 comments2 min readLW link
(arxiv.org)

Com­pact Proofs of Model Perfor­mance via Mechanis­tic Interpretability

24 Jun 2024 19:27 UTC
104 points
4 comments8 min readLW link
(arxiv.org)