SAE Feature Matchmaking (Layer-to-Layer)

Last week I read Mechanistic Permutability: Match Features Across Layers, an interesting paper on matching features detected with Spare Autoencoders across multiple layers of the Transformer neural network.

In this paper, the authors studying the problem of aligning SAE extracted features across multiple layers in the neural network without having input data to trace. They use the assumption that features in different layers are similar but have different permutations. They are looking for a permutation matrix that aligns similar features across one layer and another layer B.

I had a high level understanding of polysemanticity and feature superposition. Mechanistic interpretability researchers face a huge challenge when figuring out which feature in a neural network is represented by a particular neuron because multiple neurons may activate in the presence of two of more features that don’t occur in the same input example.

One famous example of polysemanticity is a neuron in a model was activating when an input image contained a car and also when the input image contained a cat.

To get around this issue, the researchers in the paper used Sparse Auto-encoders to extract features from an individual hidden layer. Here, they faced another issue when attempting to track the same features across multiple layers. They believed that being able to solve this problem is important to fully understand how features evolve throughout the model.

The researchers created SAE Match, which is a data free method for tracking SAE features across multiple layers. In other words, without an input, they can track the evolvement of features throughout the layers of a neural network.

They also use a technique called parameter folding, which involves integrating activation threshold and encoder/decoder weights to account for differences in features scales, which improves feature matching quality significantly.

To do this, they look for permutation matrix that reduces MSE between decoder weights, treating the matching problem as a Linear Assignment Problem. This addresses the problem statement by enabling us to understand cross-layer hidden states (still confused about how to understand this concept) and also the features’ movement across layers.

Some promising future directions might include researching non-bijective matching. This paper currently uses one-to-one matching between feature neurons but in the future it would be better to explore more elegant methods to handle features that don’t match in this way.

We could also conduct a more precise analysis of the behavior of different weights within the module as our current scheme uses all the weights.

Here is a link to my (rough) code implementation using the nnsight library from NDIF.