Can

Karma: 260

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

2 Aug 2024 19:50 UTC

38 points

1 comment9 min readLW link

OthelloGPT learned a bag of heuristics

jylin04, JackS, Adam Karvonen and Can

2 Jul 2024 9:12 UTC

111 points

10 comments9 min readLW link

Past Tense Features

Can20 Apr 2024 14:34 UTC

12 points

0 comments4 min readLW link

An adversarial example for Direct Logit Attribution: memory management in gelu-4l

Can, Yeu-Tong Lau, James Dao and Jett Janiak

30 Aug 2023 17:36 UTC

17 points

0 comments8 min readLW link

(arxiv.org)

Can 13 May 2023 12:39 UTC
2 points
0
in reply to: TurnTrout’s comment on: Understanding mesa-optimization using toy models
Thanks for pointing this out – indeed, our phrasing is quite unclear. The original paragraph was trying to say that our “system” (a transformer trained to find shortest paths via SGD) may learn “alternative objectives” which don’t generalize (aren’t “desirable” from our perspective), but which achieve the same loss (are “rewarding”).
To be clear, the point we want to make here is that models capable of perfoming search are relevant for understanding mesa-optimization as search requires iterative reasoning with subgoal evaluation.
In the context of solving mazes, we may hope to understand how mesa-optimization arises and can become “misaligned”; either through the formation of non-general reasoning steps (reliance on heuristics or overfitted goals) or failure to retarget.
Concretely, we can imagine the network learning to reach the <END_TOKEN> at train time, but failing to generalise at test time as it has instead learnt a goal that was an artefact of our training process. For example, it may have learnt to go to the top right corner (where the <END_TOKEN> happened to be during training).

Understanding mesa-optimization using toy models

tilmanr, rusheb, Guillaume Corlouer, Dan Valentine, afspies, mivanitskiy and Can

7 May 2023 17:00 UTC

46 points

6 comments10 min readLW link

Safety of Self-Assembled Neuromorphic Hardware

Can26 Dec 2022 18:51 UTC

16 points

2 comments10 min readLW link

(forum.effectivealtruism.org)