Kola Ayonrinde

Karma: 147

The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability

Kola Ayonrinde17 Aug 2025 23:38 UTC

29 points

(arxiv.org)

Kola Ayonrinde 19 Mar 2025 13:22 UTC
2 points
0
in reply to: Seonglae Cho’s comment on: Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations
Hi Seonglae, glad you enjoyed the post!

Yes this is correct, we also multiplied the 1999 number by 7 to represent the number of bits in a float (we assumed 8 bit floats but without specifying the sign as SAE feature magnitudes are always positive which gives 7 bits).

It could be argued that in fact in this case we might not want to think of features as scalars (ie float valued) and use the numbers as you describe them above. In that case note that the value still exceeds the typical description length from the SAEs (1405 bits). This is mostly an illustrative example as it assumes features are uniformly distributed for exposition, in practise we might expect the SAEs to perform even better as we are able to exploit the fact that some features are much more common than others etc

Thanks for your comment!

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

11 Dec 2024 6:30 UTC

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Kola Ayonrinde 7 Nov 2024 19:48 UTC
3 points
0
in reply to: micahcarroll’s comment on: Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Ahh sorry, I think I made this comment on an early draft of this post and didn’t realise it would make it into the published version! I totally agree with you and made the above comment in a hope for this point to be be made more clear in later drafts, which I think it has!

It looks like I can’t delete a comment which has a reply so I’ll add a note to reflect this.

Anyways, loved the paper—very cool research!

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Kola Ayonrinde30 Oct 2024 22:50 UTC

27 points

0 comments12 min readLW link

Kola Ayonrinde 15 Oct 2024 22:13 UTC
3 points
2
on: Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
can lead
Is it that it can lead to this or that it reliably does in your experiments?
[EDIT: This comment was intended as feedback on an early draft of this post (i.e. why it’s dated for before the post was published) and not meant for the final version.]

Kola Ayonrinde 19 Sep 2024 20:35 UTC
2 points
0
in reply to: Jacob Dunefsky’s comment on: Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations
Yeah, we hope others take on this approach too!
have you considered quantizing different features’ activations differently?
Stay tuned for our upcoming work 👀
do the rate-distortion curves of different SAEs intersect? I.e. is it the case that some SAE A achieves a lower loss than SAE B at a low bitrate, but then at a high bitrate, SAE B is better than SAE A? If so, then this might suggest a way to infer hierarchies of features from a set of SAEs: use SAE A to get low-resolution information about your input, and then use SAE B for the high-res detailed information.
This is an interesting perspective—my initial hypothesis before reading your comment was that allowing for variable bitrates for a single SAE would get around this issue but I agree that this would be interesting to test and one that we should definitely check!
With the constant bit-rate version, then I do expect that we would see something like this, though we haven’t rigorously tested that hypothesis.
I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we’re wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)… Or maybe we should switch to Representation Engineering, or even more coarse-grained working at the level of heads etc. Perhaps SAEs don’t have to be all things to all people!
I’d be interested to hear any opposing views that we really might want many SAEs at different resolutions though*
Thanks for your questions and thoughts, we’re really interested in pushing this further and will hopefully have some follow-up work in the not-too-distant future
EDIT: *I suspect some of the reason that people want different levels of SAEs is that they accept undesirable feature splitting as a fact of life and so want to be able to zoom in and out of features which may not be “atomic”. I’m hoping that if we can address the feature splitting problem, then at least that reason may have somewhat less pull

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde, Michael Pearce and Lee Sharkey

23 Aug 2024 18:52 UTC

43 points

8 comments16 min readLW link

Kola Ayonrinde

The Strange Science of In­ter­pretabil­ity: Re­cent Papers and a Read­ing List for the Philos­o­phy of Interpretability

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Stan­dard SAEs Might Be In­co­her­ent: A Choos­ing Prob­lem & A “Con­cise” Solution

In­ter­pretabil­ity as Com­pres­sion: Re­con­sid­er­ing SAE Ex­pla­na­tions of Neu­ral Ac­ti­va­tions with MDL-SAEs

The Strange Science of Interpretability: Recent Papers and a Reading List for the Philosophy of Interpretability

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Standard SAEs Might Be Incoherent: A Choosing Problem & A “Concise” Solution

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs