StefanHex

Karma: 1,955

Stefan Heimersheim. Interpretability researcher at FAR.AI, previously Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Activation Plateaus: Where and How They Emerge

Matthew Shinkle and StefanHex

17 Oct 2025 5:48 UTC

24 points

0 comments8 min readLW link

Transformers Don’t Need LayerNorm at Inference Time: Implications for Interpretability

submarat, Joachim Schaeffer, Luca Baroni, galvsk and StefanHex

23 Jul 2025 14:55 UTC

31 points

0 comments7 min readLW link

Trusted monitoring, but with deception probes.

Avi Parrack, StefanHex and Cleo Nardo

23 Jul 2025 5:26 UTC

31 points

0 comments4 min readLW link

(arxiv.org)

Compressed Computation is (probably) not Computation in Superposition

Jai Bhagat, Sara Molas Medina, Giorgi Giglemiani and StefanHex

23 Jun 2025 19:35 UTC

56 points

9 comments10 min readLW link

Try training token-level probes

StefanHex14 Apr 2025 11:56 UTC

47 points

6 comments8 min readLW link

Proof-of-Concept Debugger for a Small LLM

Peter Lai and StefanHex

17 Mar 2025 22:27 UTC

27 points

0 comments11 min readLW link

Detecting Strategic Deception Using Linear Probes

Nicholas Goldowsky-Dill, bilalchughtai, StefanHex and Marius Hobbhahn

6 Feb 2025 15:46 UTC

104 points

9 comments2 min readLW link

(arxiv.org)

SAE regularization produces more interpretable models

Peter Lai and StefanHex

28 Jan 2025 20:02 UTC

21 points

7 comments4 min readLW link

Attribution-based parameter decomposition

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel and Lee Sharkey

25 Jan 2025 13:12 UTC

108 points

21 comments4 min readLW link

(publications.apolloresearch.ai)

Analyzing how SAE features evolve across a forward pass

bensenberner, danibalcells, Michael Oesterle, Ediz Ucar and StefanHex

7 Nov 2024 22:07 UTC

47 points

0 comments1 min readLW link

(arxiv.org)

Characterizing stable regions in the residual stream of LLMs

Jett Janiak, jacek, Chatrik, Giorgi Giglemiani, nlpet and StefanHex

26 Sep 2024 13:44 UTC

43 points

4 comments1 min readLW link

(arxiv.org)

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, nlpet, Chatrik, Jett Janiak and StefanHex

25 Sep 2024 20:37 UTC

30 points

0 comments3 min readLW link

(arxiv.org)

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel Lee and StefanHex

6 Sep 2024 2:28 UTC

28 points

0 comments12 min readLW link

You can remove GPT2’s LayerNorm by fine-tuning for an hour

StefanHex8 Aug 2024 18:33 UTC

166 points

11 comments8 min readLW link

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex and Nicholas Goldowsky-Dill

18 Jul 2024 14:15 UTC

124 points

18 comments18 min readLW link

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex and jake_mendel

5 Jul 2024 17:05 UTC

65 points

2 comments5 min readLW link

StefanHex’s Shortform

StefanHex5 Jul 2024 14:31 UTC

5 points

81 comments1 min readLW link

Apollo Research 1-year update

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke and rusheb

29 May 2024 17:44 UTC

93 points

0 comments7 min readLW link

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq, jake_mendel, StefanHex and Kaarel

20 May 2024 17:55 UTC

23 points

7 comments6 min readLW link

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache and Marius Hobbhahn

20 May 2024 17:53 UTC

108 points

4 comments3 min readLW link