Kola Ayonrinde comments on Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Kola Ayonrinde 19 Mar 2025 13:22 UTC
2 points
0
Hi Seonglae, glad you enjoyed the post!

Yes this is correct, we also multiplied the 1999 number by 7 to represent the number of bits in a float (we assumed 8 bit floats but without specifying the sign as SAE feature magnitudes are always positive which gives 7 bits).

It could be argued that in fact in this case we might not want to think of features as scalars (ie float valued) and use the numbers as you describe them above. In that case note that the value still exceeds the typical description length from the SAEs (1405 bits). This is mostly an illustrative example as it assumes features are uniformly distributed for exposition, in practise we might expect the SAEs to perform even better as we are able to exploit the fact that some features are much more common than others etc

Thanks for your comment!
- Seonglae Cho 20 Mar 2025 23:29 UTC
  1 point
  0
  Parent
  Thank you for your answer. I understand that in the extreme case and in the illustrative example. The 1999-bit value was derived from binary decisions for each token, as you mentioned, while it exceeds the typical DL.
  More importantly, optimizing the tradeoff between sparsity and description length is like solving a convex optimization problem. It would be great to formalize this relationship and observe the trend between sparsity (x-axis) and DL (y-axis), although I have no specific approach in mind. My intuition is that the MDL might serve as a lower bound, with the overall behavior being approximated by the dominant factor’s information in each regime.