Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
jake_mendel
Karma:
344
Interpretability Researcher at Apollo Research
All
Posts
Comments
New
Top
Old
Interpretability: Integrated Gradients is a decent attribution method
Lucius Bushnaq
,
jake_mendel
,
StefanHex
and
Kaarel
20 May 2024 17:55 UTC
14
points
7
comments
6
min read
LW
link
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq
,
jake_mendel
,
Dan Braun
,
StefanHex
,
Nicholas Goldowsky-Dill
,
Kaarel
,
Avery
,
Joern Stoehler
,
debrevitatevitae
,
Magdalena Wache
and
Marius Hobbhahn
20 May 2024 17:53 UTC
97
points
2
comments
3
min read
LW
link
A starting point for making sense of task structure (in machine learning)
Kaarel
,
RP
and
jake_mendel
24 Feb 2024 1:51 UTC
39
points
2
comments
12
min read
LW
link
Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob
,
jake_mendel
and
Kaarel
18 Jan 2024 21:06 UTC
184
points
17
comments
73
min read
LW
link
Back to top