Seonglae Cho comments on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Seonglae Cho 17 Mar 2025 14:39 UTC
1 point
0
Interesting! This way of finding a desirable dictionary size and sparsity is fascinating. Also, it’s intriguing that the MDL incentivizes SAEs to generate hierarchical features rather than feature splitting.

I have some questions regarding the upper-bound DL computation:
One-hot encodings: At the sparse extreme, our dictionary could have a row for each neural activation in the dataset, so $L_{0}$ =1 and $D = (v o c a b_s i z e)^{s e q_l e n}$ . GPT-2 has a vocab size of 50,257 and the SAEs are trained 128 token sequences. All together this gives DL=13,993 bits per token.
I can easily compute the above two values following your instruction; however, I’m having trouble computing the 13,993-bit value, or perhaps I’ve missed something. My calculation $128 \cdot {log}_{2} 50, 257$ results in 1998.98. Could you please clarify which part of my calculation is incorrect?
Another question is about why the sequence length is considered in the extreme sparsity example. It seems to consider all possible token sequences. Is this intended for a fair comparison since the two examples above consider sequence context within relatively dense vectors?
- Kola Ayonrinde 19 Mar 2025 13:22 UTC
  2 points
  0
  Parent
  Hi Seonglae, glad you enjoyed the post!
  
  Yes this is correct, we also multiplied the 1999 number by 7 to represent the number of bits in a float (we assumed 8 bit floats but without specifying the sign as SAE feature magnitudes are always positive which gives 7 bits).
  
  It could be argued that in fact in this case we might not want to think of features as scalars (ie float valued) and use the numbers as you describe them above. In that case note that the value still exceeds the typical description length from the SAEs (1405 bits). This is mostly an illustrative example as it assumes features are uniformly distributed for exposition, in practise we might expect the SAEs to perform even better as we are able to exploit the fact that some features are much more common than others etc
  
  Thanks for your comment!
  - Seonglae Cho 20 Mar 2025 23:29 UTC
    1 point
    0
    Parent
    Thank you for your answer. I understand that in the extreme case and in the illustrative example. The 1999-bit value was derived from binary decisions for each token, as you mentioned, while it exceeds the typical DL.
    More importantly, optimizing the tradeoff between sparsity and description length is like solving a convex optimization problem. It would be great to formalize this relationship and observe the trend between sparsity (x-axis) and DL (y-axis), although I have no specific approach in mind. My intuition is that the MDL might serve as a lower bound, with the overall behavior being approximated by the dominant factor’s information in each regime.