Georg Lange

Karma: 126

Automating Interpretability with Agents

Jack Payne, ksena and Georg Lange

1 May 2026 2:59 UTC

10 points

0 comments10 min readLW link

Georg Lange 4 Feb 2026 23:02 UTC
2 points
0
in reply to: Ziqian Zhong’s comment on: Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits
Thanks for your comment, I really appreciate it!
(a) I claim we cannot construct a faithful one-layer CLT because the real circuit contains two separate steps! What we can do (and we kinda did) is to have a one-layer CLT to have perfect reconstruction always and everywhere. But the point of a CLT is NOT to have perfect reconstruction—it’s to teach us how the original model solved the task.
(b) I think that’s the point of our post! That CLTs miss middle-layer features. And that this is not a one-off error but that the sparsity loss function incentivizes solutions that miss middle-layer features. We are working on adding more case studies to make this claim more robust.

Georg Lange 4 Feb 2026 22:55 UTC
1 point
0
in reply to: jacob_drori’s comment on: Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits
Thanks for your thoughtful comment! I think that’s an interesting idea. If a replacement model is faithful, you would indeed expect it to behave similar when tested on other data. This may possibly connect to a related finding that we couldn’t yet explain: that CLTs fail for memorized sequences, when those should be the simplest. Maybe the OOD test is something that we could do in real LLMs as well, not just toy models.
I don’t think I’d go as far as saying “faithfulness==OOD generalization” though. I think it is possible that two different circuits produce the same behavior, even on OOD. For example, let model A and B learn to multiplicate arbitrary numbers perfectly. There are different ways to do x * y, for example compute +x y times, do it like we learned in school, etc. In those cases, OOD examples would still deliver equal output but the internal circuits are different. So I’d say that faithfulness is a stronger claim than OOD generalization.

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits

Georg Lange, RGRGRG, Kat Dearstyne and Kamal Maher

2 Feb 2026 21:32 UTC

46 points

6 comments18 min readLW link

SAEs Discover Meaningful Features in the IOI Task

Alex Makelov, Georg Lange and Neel Nanda

5 Jun 2024 23:48 UTC

15 points

2 comments10 min readLW link

Georg Lange 19 Mar 2024 17:53 UTC
LW: 3 AF: 2
0
AF
on: Some costs of superposition
Calculating l, the maximal number of simultaneously active features, yields strange results. For example, if we have 100 features and 100 neurons, l has to be < 100/(8 * ln(100)) = 2.7. But I would expect that 100 features can be simultaneously active because we have 100 dimensions, so the features can be orthogonal and independent. Am I understanding something wrong?

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces

Georg Lange, Alex Makelov and Neel Nanda

29 Aug 2023 1:04 UTC

77 points

4 comments1 min readLW link