Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
StefanHex
Karma:
546
Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability.
All
Posts
Comments
New
Top
Old
CNN feature visualization in 50 lines of code
StefanHex
26 May 2022 11:02 UTC
17
points
4
comments
5
min read
LW
link
Research Questions from Stained Glass Windows
StefanHex
8 Jun 2022 12:38 UTC
4
points
0
comments
2
min read
LW
link
Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?
StefanHex
and
Julian_R
25 Oct 2022 20:48 UTC
14
points
2
comments
4
min read
LW
link
How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!
StefanHex
24 Jan 2023 18:45 UTC
47
points
5
comments
13
min read
LW
link
A circuit for Python docstrings in a 4-layer attention-only transformer
StefanHex
and
Jett
20 Feb 2023 19:35 UTC
91
points
6
comments
21
min read
LW
link
Residual stream norms grow exponentially over the forward pass
StefanHex
and
TurnTrout
7 May 2023 0:46 UTC
72
points
24
comments
11
min read
LW
link
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1
StefanHex
and
Marius Hobbhahn
9 May 2023 19:41 UTC
119
points
1
comment
10
min read
LW
link
Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2
StefanHex
and
Marius Hobbhahn
25 May 2023 15:37 UTC
71
points
1
comment
13
min read
LW
link
How to use and interpret activation patching
StefanHex
and
Neel Nanda
24 Apr 2024 8:35 UTC
10
points
0
comments
18
min read
LW
link
Back to top