Adrià Garriga-alonso

Karma: 1,230

Adrià Garriga-alonso May 5, 2025, 8:00 PM
10 points
4
in reply to: kilgoar’s comment on: Playing in the Creek
I think it’s still true that a lot of individual AI researchers are motivated by something approximating play (they call it “solving interesting problems”), even if the entity we call Anthropic is not. These researchers are in Anthropic and outside of it. This applies to all companies of course.

Sparsity is the enemy of feature extraction (ft. absorption)

7vik, chanind and Adrià Garriga-alonso

May 3, 2025, 10:13 AM

32 points

0 comments6 min readLW link

Adrià Garriga-alonso May 1, 2025, 12:59 PM
2 points
0
in reply to: Tachikoma’s comment on: Among Us: A Sandbox for Agentic Deception
Agreed that likely humans would outperform more! At the moment we don’t have a human baseline for AmongUs vs. language models yet, so we wouldn’t be able to tell if it improved, but it’s a good follow-up.

Among Us: A Sandbox for Agentic Deception

7vik and Adrià Garriga-alonso

Apr 5, 2025, 6:24 AM

110 points

7 comments7 min readLW link

A Bunch of Matryoshka SAEs

chanind, TomasD and Adrià Garriga-alonso

Apr 4, 2025, 2:53 PM

25 points

0 comments8 min readLW link

Feature Hedging: Another way correlated features break SAEs

chanind, TomasD and Adrià Garriga-alonso

Mar 25, 2025, 2:33 PM

22 points

0 comments18 min readLW link

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

Feb 7, 2025, 3:57 AM

29 points

0 comments10 min readLW link

Crafting Polysemantic Transformer Benchmarks with Known Circuits

Evan Anders and Adrià Garriga-alonso

Aug 23, 2024, 10:03 PM

17 points

0 comments25 min readLW link

Adrià Garriga-alonso Aug 1, 2024, 6:12 PM
LW: 2 AF: 1
0
AF
in reply to: Nathan Helm-Burger’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
I’m curious what you mean, but I don’t entirely understand. If you give me a text representation of the level I’ll run it! :) Or you can do so yourself
Here’s the text representation for level 53
```
##########
##########
##########
#######  #
######## #
#   ###.@#
#   $ $$ #
#. #.$   #
#     . ##
##########
```

Adrià Garriga-alonso Jul 26, 2024, 9:07 PM
LW: 1 AF: 1
0
AF
in reply to: Chris_Leong’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Maybe in this case it’s a “confusion” shard? While it seems to be planning and produce optimizing behavior, it’s not clear that it will behave as a utility maximizer.

Adrià Garriga-alonso Jul 26, 2024, 9:06 PM
LW: 2 AF: 1
0
AF
in reply to: Lee Sharkey’s comment on: Pacing Outside the Box: RNNs Learn to Plan in Sokoban
Thank you!! I agree it’s a really good mesa-optimizer candidate, it remains to see now exactly how good. It’s a shame that I only found out about it about a year ago :)

Pacing Outside the Box: RNNs Learn to Plan in Sokoban

Adrià Garriga-alonso, taufeeque, AdamGleave and ChengCheng

Jul 25, 2024, 10:00 PM

59 points

8 comments2 min readLW link

(arxiv.org)

Adrià Garriga-alonso Jul 9, 2024, 2:57 PM
3 points
0
on: AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0
Asking for an acquaintance. If I know some graduate-level machine learning, and have read ~most of the recent mechanistic interpretability literature, and have made good progress understanding a small-ish neural network in the last few months.

Is ARENA for me, or will it teach things I mostly already know?

(I advised this person that they already have ARENA-graduate level, but I want to check in case I’m wrong.)

Compact Proofs of Model Performance via Mechanistic Interpretability

LawrenceC, rajashree, Adrià Garriga-alonso and Jason Gross

Jun 24, 2024, 7:27 PM

97 points

4 comments8 min readLW link

(arxiv.org)

Adrià Garriga-alonso May 17, 2024, 10:44 PM
4 points
0
on: Language Models Model Us
How did you feed the data into the model and get predictions? Was there a prompt and then you got the model’s answer? Then you got the logits from the API? What was the prompt?

Catastrophic Goodhart in RL with KL penalty

Thomas Kwa and Adrià Garriga-alonso

May 15, 2024, 12:58 AM

62 points

10 comments7 min readLW link

Adrià Garriga-alonso May 9, 2024, 1:49 AM
4 points
0
on: Why I’m doing PauseAI
Thank you for working on this Joseph!

Adrià Garriga-alonso 19 Apr 2024 2:19 UTC
1 point
0
in reply to: Chakshu Mira’s comment on: Ophiology (or, how the Mamba architecture works)
Thank you! Could you please provide more context? I don’t know what ‘E’ you’re referring to.

An evaluation of circuit evaluation metrics

Iván Arcuschin, Niels uit de Bos and Adrià Garriga-alonso

15 Apr 2024 19:38 UTC

18 points

0 comments4 min readLW link

Ophiology (or, how the Mamba architecture works)

Danielle Ensign, SrGonao and Adrià Garriga-alonso

9 Apr 2024 19:31 UTC

67 points

8 comments10 min readLW link