Adrià Garriga-alonso

Karma: 1,751

How much superposition is there?

chanind and Adrià Garriga-alonso

18 Feb 2026 13:53 UTC

25 points

0 comments3 min readLW link

Circuit discovery through chain of thought using policy gradients

Adrià Garriga-alonso29 Nov 2025 7:56 UTC

15 points

2 comments4 min readLW link

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Simon Lermen and Adrià Garriga-alonso

27 Nov 2025 19:19 UTC

50 points

3 comments1 min readLW link

(simonlermen.substack.com)

Spatially distributed consciousness is not an abstract thought experiment if AI is conscious

Adrià Garriga-alonso26 Nov 2025 4:59 UTC

17 points

1 comment4 min readLW link

(open.substack.com)

Alignment will happen by default. What’s next?

Adrià Garriga-alonso25 Nov 2025 7:14 UTC

99 points

133 comments6 min readLW link

(open.substack.com)

Infinitesimally False

Adrià Garriga-alonso and abramdemski

21 Nov 2025 4:57 UTC

52 points

16 comments12 min readLW link

A scheme to credit hack policy gradient training

Adrià Garriga-alonso7 Nov 2025 6:24 UTC

15 points

0 comments5 min readLW link

Anthropic’s JumpReLU training method is really good

chanind and Adrià Garriga-alonso

3 Oct 2025 15:23 UTC

51 points

2 comments2 min readLW link

A recurrent CNN finds maze paths by filling dead-ends

Adrià Garriga-alonso15 Sep 2025 20:49 UTC

19 points

0 comments2 min readLW link

The “Sparsity vs Reconstruction Tradeoff” Illusion

chanind and Adrià Garriga-alonso

26 Aug 2025 4:39 UTC

21 points

0 comments4 min readLW link

L0 is not a neutral hyperparameter

chanind and Adrià Garriga-alonso

19 Jul 2025 13:51 UTC

24 points

3 comments5 min readLW link

Can We Change the Goals of a Toy RL Agent?

tuphs and Adrià Garriga-alonso

15 Jun 2025 20:34 UTC

20 points

0 comments9 min readLW link

Sparsity is the enemy of feature extraction (ft. absorption)

7vik, chanind and Adrià Garriga-alonso

3 May 2025 10:13 UTC

32 points

0 comments6 min readLW link

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

5 Apr 2025 6:24 UTC

114 points

7 comments7 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

4 Apr 2025 14:53 UTC

29 points

0 comments8 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

25 Mar 2025 14:33 UTC

23 points

0 comments18 min readLW link

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

7 Feb 2025 3:57 UTC

37 points

0 comments10 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

23 Aug 2024 22:03 UTC

17 points

0 comments25 min readLW link

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

25 Jul 2024 22:00 UTC

59 points

8 comments2 min readLW link

(arxiv.org)

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

24 Jun 2024 19:27 UTC

104 points

4 comments8 min readLW link

(arxiv.org)