StefanHex

Karma: 546

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability.

CNN feature visualization in 50 lines of code

StefanHex26 May 2022 11:02 UTC

17 points

4 comments5 min readLW link

Research Questions from Stained Glass Windows

StefanHex8 Jun 2022 12:38 UTC

4 points

0 comments2 min readLW link

Reinforcement Learning Goal Misgeneralization: Can we guess what kind of goals are selected by default?

StefanHex and Julian_R

25 Oct 2022 20:48 UTC

14 points

2 comments4 min readLW link

How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC

47 points

5 comments13 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett

20 Feb 2023 19:35 UTC

91 points

6 comments21 min readLW link

Residual stream norms grow exponentially over the forward pass

StefanHex and TurnTrout

7 May 2023 0:46 UTC

72 points

24 comments11 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

StefanHex and Marius Hobbhahn

9 May 2023 19:41 UTC

119 points

1 comment10 min readLW link

Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

StefanHex and Marius Hobbhahn

25 May 2023 15:37 UTC

71 points

1 comment13 min readLW link

How to use and interpret activation patching

StefanHex and Neel Nanda

24 Apr 2024 8:35 UTC

10 points

0 comments18 min readLW link