Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
jake_mendel
Karma:
1,260
technical AI safety program associate at OpenPhil
All
Posts
Comments
New
Top
Old
Research directions Open Phil wants to fund in technical AI safety
jake_mendel
,
maxnadeau
and
Peter Favaloro
8 Feb 2025 1:40 UTC
117
points
21
comments
58
min read
LW
link
(www.openphilanthropy.org)
Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
jake_mendel
,
maxnadeau
and
Peter Favaloro
6 Feb 2025 18:58 UTC
111
points
0
comments
1
min read
LW
link
(www.openphilanthropy.org)
Attribution-based parameter decomposition
Lucius Bushnaq
,
Dan Braun
,
StefanHex
,
jake_mendel
and
Lee Sharkey
25 Jan 2025 13:12 UTC
108
points
22
comments
4
min read
LW
link
(publications.apolloresearch.ai)
Circuits in Superposition: Compressing many small neural networks into one
Lucius Bushnaq
and
jake_mendel
14 Oct 2024 13:06 UTC
131
points
9
comments
13
min read
LW
link
jake_mendel’s Shortform
jake_mendel
19 Sep 2024 10:37 UTC
5
points
3
comments
LW
link
[Interim research report] Activation plateaus & sensitive directions in GPT2
StefanHex
and
jake_mendel
5 Jul 2024 17:05 UTC
65
points
2
comments
5
min read
LW
link
SAE feature geometry is outside the superposition hypothesis
jake_mendel
24 Jun 2024 16:07 UTC
229
points
17
comments
11
min read
LW
link
Apollo Research 1-year update
Marius Hobbhahn
,
Lee Sharkey
,
Lucius Bushnaq
,
Dan Braun
,
Mikita Balesni
,
Jérémy Scheurer
,
Nicholas Goldowsky-Dill
,
StefanHex
,
jake_mendel
,
AlexMeinke
and
rusheb
29 May 2024 17:44 UTC
93
points
0
comments
7
min read
LW
link
Interpretability: Integrated Gradients is a decent attribution method
Lucius Bushnaq
,
jake_mendel
,
StefanHex
and
Kaarel
20 May 2024 17:55 UTC
23
points
7
comments
6
min read
LW
link
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Lucius Bushnaq
,
jake_mendel
,
Dan Braun
,
StefanHex
,
Nicholas Goldowsky-Dill
,
Kaarel
,
Avery
,
Joern Stoehler
,
debrevitatevitae
,
Magdalena Wache
and
Marius Hobbhahn
20 May 2024 17:53 UTC
108
points
4
comments
3
min read
LW
link
A starting point for making sense of task structure (in machine learning)
Kaarel
,
RP
and
jake_mendel
24 Feb 2024 1:51 UTC
45
points
2
comments
12
min read
LW
link
Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob
,
jake_mendel
and
Kaarel
18 Jan 2024 21:06 UTC
206
points
19
comments
63
min read
LW
link
Back to top