RSS

StefanHex

Karma: 1,955

Stefan Heimersheim. Interpretability researcher at FAR.AI, previously Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Ac­ti­va­tion Plateaus: Where and How They Emerge

17 Oct 2025 5:48 UTC
24 points
0 comments8 min readLW link

Trans­form­ers Don’t Need Lay­erNorm at In­fer­ence Time: Im­pli­ca­tions for Interpretability

23 Jul 2025 14:55 UTC
31 points
0 comments7 min readLW link

Trusted mon­i­tor­ing, but with de­cep­tion probes.

23 Jul 2025 5:26 UTC
31 points
0 comments4 min readLW link
(arxiv.org)

Com­pressed Com­pu­ta­tion is (prob­a­bly) not Com­pu­ta­tion in Superposition

23 Jun 2025 19:35 UTC
56 points
9 comments10 min readLW link

Try train­ing to­ken-level probes

StefanHex14 Apr 2025 11:56 UTC
47 points
6 comments8 min readLW link

Proof-of-Con­cept De­bug­ger for a Small LLM

17 Mar 2025 22:27 UTC
27 points
0 comments11 min readLW link

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

6 Feb 2025 15:46 UTC
104 points
9 comments2 min readLW link
(arxiv.org)

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

28 Jan 2025 20:02 UTC
21 points
7 comments4 min readLW link

At­tri­bu­tion-based pa­ram­e­ter decomposition

25 Jan 2025 13:12 UTC
108 points
21 comments4 min readLW link
(publications.apolloresearch.ai)

An­a­lyz­ing how SAE fea­tures evolve across a for­ward pass

7 Nov 2024 22:07 UTC
47 points
0 comments1 min readLW link
(arxiv.org)

Char­ac­ter­iz­ing sta­ble re­gions in the resi­d­ual stream of LLMs

26 Sep 2024 13:44 UTC
43 points
4 comments1 min readLW link
(arxiv.org)

Eval­u­at­ing Syn­thetic Ac­ti­va­tions com­posed of SAE La­tents in GPT-2

25 Sep 2024 20:37 UTC
30 points
0 comments3 min readLW link
(arxiv.org)

In­ves­ti­gat­ing Sen­si­tive Direc­tions in GPT-2: An Im­proved Baseline and Com­par­a­tive Anal­y­sis of SAEs

6 Sep 2024 2:28 UTC
28 points
0 comments12 min readLW link

You can re­move GPT2’s Lay­erNorm by fine-tun­ing for an hour

StefanHex8 Aug 2024 18:33 UTC
166 points
11 comments8 min readLW link

A List of 45+ Mech In­terp Pro­ject Ideas from Apollo Re­search’s In­ter­pretabil­ity Team

18 Jul 2024 14:15 UTC
124 points
18 comments18 min readLW link

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

5 Jul 2024 17:05 UTC
65 points
2 comments5 min readLW link

Ste­fanHex’s Shortform

StefanHex5 Jul 2024 14:31 UTC
5 points
81 comments1 min readLW link

Apollo Re­search 1-year update

29 May 2024 17:44 UTC
93 points
0 comments7 min readLW link

In­ter­pretabil­ity: In­te­grated Gra­di­ents is a de­cent at­tri­bu­tion method

20 May 2024 17:55 UTC
23 points
7 comments6 min readLW link

The Lo­cal In­ter­ac­tion Ba­sis: Iden­ti­fy­ing Com­pu­ta­tion­ally-Rele­vant and Sparsely In­ter­act­ing Fea­tures in Neu­ral Networks

20 May 2024 17:53 UTC
108 points
4 comments3 min readLW link