Research Report: Sparse Autoencoders find only 9/​180 board state features in OthelloGPT

Link post

[3/​7 Edit: I have rephrased the bolded claims in the abstract per this comment from Joseph Bloom, hopefully improving the heat-to-light ratio.

Commenters have also suggested training on earlier layers and using untied weights, and in my experiments this increases the number of classifiers found, so the headline number should be 33180 features, up from 9180. See this comment for updated results.]

Abstract

A sparse autoencoder is a neural network architecture that has recently gained popularity as a technique to find interpretable features in language models (Cunningham et al, Anthropic’s Bricken et al). We train a sparse autoencoder on OthelloGPT, a language model trained on transcripts of the board game Othello, which has been shown to contain a linear representation of the board state, findable by supervised probes. The sparse autoencoder finds 9 features which serve as high-accuracy classifiers of the board state, out of 180 findable with supervised probes (and 192 possible piece/​position combinations) [edit: 33180 features, see this comment]. Across random seeds, the autoencoder repeatedly finds “simpler” features concentrated on the center of the board and the corners. This suggests that even if a language model can be interpreted with a human-understandable ontology of interesting, interpretable linear features, a sparse autoencoder might not find a significant number of those features.

Introduction

There has been a recent flurry of research activity around Sparse Autoencoders for Dictionary Learning, a new approach to finding interpretable features in language models and potentially “solving superposition” (Sharkey et al, Anthropic’s Bricken et al, Cunningham et al.). But while this technique can find features which are interpretable, it is not yet clear if sparse autoencoders can find particular features of interest (e.g., features relevant to reducing AI risk).

This research report seeks to answer the question of whether sparse autoencoders can find a set of a-priori existing, interesting, and interpretable features in the OthelloGPT language model. OthelloGPT, as the name suggests, is a language model trained on transcripts of the board game Othello to predict legal moves, but was found to also linearly encode the current board state (Nanda, Hazineh et al). That is, for each of the 64 board positions, there were “board-state features” (linear mappings from the residual stream to \R^3) that classify the state at that position between [is empty] vs [has active-player’s piece] vs [has enemy’s piece], and these board-state features can be found by the supervised training of a linear probe. These board-state features are an exciting testbed for sparse autoencoders because they represent a set of “called-shot” features we hope to find, and which are extremely interpretable and correspond to natural human thinking[1]. If the sparse autoencoder can find these features, this is some evidence that they will find relevant and important features in language models. Conversely, if the sparse autoencoders can’t find these features, that indicates a limitation of the method, and provides a test case where we can adjust our training methods until we can find them.

Overview

Here we:

  1. Train an OthelloGPT model from scratch

  2. Train a linear probe to classify the board states (replicating Hazineh et al) from an intermediate layer of OthelloGPT.

  3. Train a sparse autoencoder on the same layer of OthelloGPT

  4. Assess whether the features found by the sparse autoencoder include the linear encoding of the current board state that the linear probe is able to find.

  5. Retrain the sparse autoencoder with different random seeds, and analyze which features are found.

An overview of the approach

Methods

Training OthelloGPT

We first trained an OthelloGPT model from scratch, following the approach of Li et al. Our model is a 25M parameter, 8-layer, decoder-only transformer, with residual stream dimension d_model=512 (identical to Li et al’s model). It is trained to do next-token-prediction of random transcripts of Othello games, with each possible move being encoded as a separate token, resulting in a vocabulary size of 66 (64 from the positions on the boards, plus 2 special tokens). The model was trained on a corpus of 640K games for 2 epochs, using the Adam optimizer with learning rate 1e-3.

The trained model had a 5% error rate in predicting next legal moves. This is far higher than Li et al’s 0.01%, which I believe is due to my shorter training run on smaller data[2]. Despite this relatively high error rate, the model has been trained to a point where it exhibits the linearly-encoded board state described by Hazineh et al, which we will show in the next section.

Training Linear Probes

We next train linear probes using its residual stream to classify the contents of individual board positions. This serves two purposes: first to confirm that our OthelloGPT model linearly encodes the board state, and secondly serves as a baseline for the classification accuracy we can expect from any sparse autoencoder features.

As in Nanda and Hazineh et al, we found that we could train higher accuracy probes if we group positions into “empty/​own/​enemy” rather than “empty/​black/​white”. Following Nanda’s recommendation, we trained our probes on the residual stream of the model just after the MLP sublayer of layer 6. Each probe is a linear classifier from the residual stream (\R^512) to the three classes (\R^3), trained to minimize cross-entropy between the true labels of the board state, and the classifier’s predictions. We train one probe for each of the 64 board positions, resulting in 3*64 directions in activation space[3]. As in Nanda’s work, we found that our classifiers had a greater accuracy if we restricted them to the “middle” turns of each Othello game, in our case turns [4, 56). The probes were trained on 100K labelled games, for 1 epoch, using the Adam optimizer with learning rate 1e-3.

The resulting probes predict board positions with an error rate of 10%. While this is much larger than Hazineh et al’s 0.5% error rate, it is far better than chance, and indicates that there is linear structure to find. We also measure classification accuracy with AUROC, since this allows us to compare probe and feature directions as classifiers. In particular, for each position, for classes A/​B/​C with scores a/​b/​c, we use the “rectified directions” a-0.5(b+c) as a score for class A vs (B or C). We find that all of the 192 rectified probe directions have an AUROC greater than .9, with the exception of the 12 features corresponding to the central 4 tiles (which begin the game filled, and therefore might be handled differently by the language model). We will therefore use .9 as the (semi-arbitrary) threshold for a “high accuracy” classifier.

A histogram of the AUROC classification accuracies of the 192 probe directions. Almost all probe directions serve as a classifier with AUROC at least .9. The small bar at 0.5 is an artifact of measuring AUROC when there is only one class, corresponding to the 4 central tiles being classified as empty/​(own+enemy). The other bar below .9 corresponds to the 8 probes for classifying those central tiles as own/​(empty+enemy) or enemy/​(own+empty).
A density plot of the rectified probe activations for position 26 across the test set. The rectified probe scores cleanly classify Empty positions, and make some progress at classifying Enemy and Own positions.

Training The Sparse Autoencoder

Our sparse autoencoder architecture is based on that in Cunningham et al, consisting of a single hidden layer with ReLU activations, tied encoder-decoder weights and a bias on the encoder but not decoder layer. As with the probes, we trained on layer 6 of the GPT model, and turns [4, 56). We used a feature ratio R=2 (1024 features for a 512-dimensional model), and a sparsity coefficient α=7.7e-2. This sparsity coefficient was chosen after a hyperparameter sweep in order to minimize the sum of unexplained variance and average number of features active. The autoencoder was trained on a corpus of 100K games for 4 epochs, using the Adam optimizer with learning rate 1e-3.

The resulting autoencoder had an average of 12% features active, 17% unexplained variance, and 0.2% dead features on the test set.

Results

SAE Features as Current-Board Classifiers

For each of the 1024 sparse autoencoder features, we can measure if they correctly classify the current board state as an empty/​own/​enemy piece. We find that there are several features which serve as highly accurate classifiers for whether a tile is empty.

Feature 395 has a high activation when Position 43 is empty, and activates less (or not at all) when there is a piece there. Data is from a test set of 2000 games. Feature scores of 0 have been separated out into the bar on the left. Depicted AUROC is .9999.

Visual inspection of the boards confirms that Feature 395 correctly classifies if position 43 is empty or filled:

(Top) The five Othello boards that most activated Feature 395. (Bottom) Five random boards. Feature activations are shown, and the feature with a cutoff of 30 correctly predicts if Position 43 (D6, marked with a red circle) is filled.

The sparse autoencoder found 9 features which act as classifiers with AUROC>.9, all for assessing when the tile is empty vs (own+enemy). The best non-empty classifier is Feature 525, classifying Position 7 with an AUROC of .86:

Feature 525 activates more when position 7 has an enemy piece, and less when it is empty or has your own piece. AUROC shown is .86.

Here are the top- and random-activating examples for this feature:

(Top) The five Othello boards that most activated Feature 525. (Bottom) Five random boards. Feature activations are shown, and the feature often, though not always, predicts if Position 7 (H1, marked with a red circle) has an enemy piece (denoted as white).

It should be noted that both of these classification tasks are computationally simpler that the other classification tasks: checking if a tile is empty is just querying the context window for the corresponding token, and since corners cannot be flipped, checking if a corner is an enemy piece is just querying the context window for that token an odd number of turns ago. (Though that’s not what the classifiers are doing, at least directly, since they only have access to the residual stream and not the tokens themselves.)

The feature best at classifying a non-corner, non-empty token is feature 688, which has an AUROC of .828:

Feature 688 activates more when position 15 has a piece than when it is empty, but just barely differentiates enemy pieces from its own pieces. AUROC shown is 0.828.

Overall, the sparse autoencoder has found some classifier features, but the vast majority of features do not classify the current board state, and the majority of board states are not well-classified by features. The features that are good classifiers correspond to “easier” classification tasks, ones that do not engage with the complexities of pieces flipping.

Which Features are Learned?

Knowing that only some classifiers are found by the sparse autoencoder, we should ask:

  1. Which classifiers?

  2. Are these directions “easier to find”, or would the autoencoder find other ones if retrained?

To test this, I trained the autoencoder 10 times, with different random seeds, and investigated which features were found. The only differences between these autoencoder training runs were: the initialization of the autoencoder weights, and the ordering of the training data (though the overall training set was the same).

I then checked if each autoencoder had a feature which acts as a classifier for a position with AUROC>.9. This is the result:

The sparse autoencoder is likely to find features which classify the inner ring of tiles, and the corners. Sparse autoencoders almost never find good classifiers for other positions. Except for the corner tiles, all classifiers with AUROC>.9 were for whether the position was empty vs filled.

This indicates that the inner-ring features are in some way easier for the autoencoder to learn, either due to the dataset used or the way OthelloGPT represents them. It seems likely that these are the most prominent features to learn since these moves are playable from the beginning of the game, and that these moves have important effects on whether other moves are playable. The lack of classifiers for the central tiles is explained by the difficulty of classifying these tiles even with linear probes (recall that the probes there had AUROC<.9). The corner classifiers also seem to be easier to learn, and are the only features with AUROC>.9 which classify enemy vs own pieces.

Overall, we can conclude that the autoencoder has a preference for learning some features over others. These features might be more “prominent” in the residual stream, or in the dataset, or in some other way, and I have not tested these hypotheses yet.

Since the model is trained to predict legal moves, one might expect it to learn features for if a move is legal. And unlike in the autoencoders-on-text case, there are fewer tokens than autoencoder features, so it would be easy to allocate 60/​1024 features for predicting tokens, if that is useful to the sparse autoencoder.

We find the autoencoder often finds features that classify whether a move is legal. However, this is confounded by the overlap of “move is legal” and “tile is empty” (the former is the later plus some extra conditions). There were several features that are decent legal-move classifiers, but when you look at their activation distributions its clear they are actually empty-tile classifiers that score well on legal move classification because P(legal | empty) was high:

Feature 722 has a high AUROC as a legal-move classifier for position 26, but is an even better classifier of whether position 26 is empty. Its high accuracy at classifying legal moves can be explained by classifying if the square is empty, plus applying the incorrect logic “empty → legal”.

Some density plots looked like Feature 722/​Position 26, showing clear confounding, and other look like this, where the distributions are nearly identical:

Finally, we can compare the AUROCs of the probes to the SAE features (both as content predictors and legal move predictors):

Histogram of AUROCs for: (Left) probe directions doing content prediction (identical to a previous figure). (Center) For each position/​content pair, the best AUROC of a sparse autoencoder feature at classifying that pair. (Right) For each position/​legality pair, the best AUROC of a sparse autoencoder feature at classifying that pair. The probe directions are concentrated near AUROC=1, while both assessments of the SAE features are more levelly distributed around AUROC=.7, with a tail to the right and a small spike near AUROC=1.

Cosine Similarities

Finally, we can directly compare the directions found by probes and the autoencoders. In particular, for each of the 192 rectified probe directions, we computed its maximum cosine similarity across the 1024 autoencoder directions. This is the result:

(Blue, left axis) Histogram of maximum cosine similarities of probe directions with autoencoder directions. (Orange, right axis) Maximum cosine similarities of probe directions with random vectors, acting as a control group. The probe directions have much higher MCS than would be expected from chance, but never exceed a cosine similarity of .7.

We can conclude that the autoencoder directions are relatively close to the probe directions in activation space, but do not perfectly match. We shouldn’t be worried about this lack of perfect matching since a correlation of .6 is enough both in theory and in practice for the autoencoder features to be ~perfect classifiers.

One Really Cool Case Study

As I was investigating high-correlation features that were bad classifiers (by AUROC), I found several features like this one, which shows clear bimodality that isn’t aligned with empty/​own/​enemy pieces:

Feature 348’s activations are bimodal, with a large peak near 0 and a smaller peak near 50. The peak near 0 is a mixture of empty/​own/​enemy pieces, but the peak at 50 is only for enemy pieces.

For this feature, when I looked at the top-activating boards, I found them to be highly interpretable. See if you can spot the pattern:

Top and random activating boards for feature 348.

It looks like this feature activates when positions F2-F6 are all white! And what’s incredible is the “partial activation” in the bottom row: the feature activates at 12 when positions E2-E6 are all white! That seems like an extremely understandable “near-miss” for the feature, which is astonishing to me.

We should here acknowledge that Othello and OthelloGPT can be harder to interpret than English text. Whereas humans will find patterns and themes in text for fun, I found my brain was very much not wired for analyzing Othello boards, and therefore in most cases I could only test feature interpretability by programmatically testing individual hypotheses. Therefore, I have not been able to analyze the vast majority of OthelloGPT features, and they may have interpretable meanings like the above that simply do not show up on my metrics. If anyone wants to do a lot of case studies of individual features, I’m happy to share the tools I have.

Conclusion

We have shown that out of a set of 180 a-priori interesting and interpretable features in OthelloGPT, sparse autoencoders find only 9 of them. While this confirms the promise of sparse autoencoders in Cunningham et al and Bricken et al, that they find interpretable features, it also underlines the limitations of the approach: this is the first work demonstrating that sparse autoencoders can fail to find a concrete set of interesting, interpretable features, and suggests that currently-existing sparse autoencoders cannot “fully” interpret a language model. We hope that these results will inspire more work to improve the architecture or training methods of sparse autoencoders to address this shortcoming. Finally, we hope we have shown that OthelloGPT, with its linear world state, is useful for measuring if unsupervised techniques find important interpretable directions, and can be a fruitful place to test interpretability techniques.

Future work:

  1. Redo this analysis on the fully-trained OthelloGPT created by Li et al.

  2. Adjust the autoencoder architecture until it is able to find more of the features we hope to see. Possible architectural changes include:

    1. Untied encoder/​decoder weights as in Anthropic’s Bricken et al.

    2. Update the architecture using the tricks described in Anthropic’s updates.

    3. Update the loss function to include an orthogonality penalization term, as described by Till.

  3. (Low-priority) Redo this analysis on the MLP layer of the transformer (as Bricken et al do) instead of the residual stream. (The MLP layers may not linearly represent the board state, so first we’d want to verify this with a new set of probes.)

  4. (Low-priority) Continue investigating individual autoencoder features.

  1. ^

    Though, notably, this was not the first way people expected OthelloGPT to encode the board state. Since humans conceptualize Othello as a game between two players, the original authors tried to find linear features representing whether a square was empty/​black/​white. The resulting classifiers were mediocre, with an error rate of 20% or more. However, Neel Nanda found that since the language model “plays both sides”, it linearly encodes the board as empty/​own/​enemy pieces (i.e., grouping “black on even turns” with “white on odd turns” instead of “black on odd turns”), and Hazineh et al find probes trained to do this classification can have an error rate as low as 0.5%.

  2. ^

    I plotted OthelloGPT’s error rate across its training, and it followed a straightforward scaling law that would have reached 0.01% error rate with a few more OOMs of training data. I opted not to train to that level, but I plan to redo my analyses on Li et al’s OthelloGPT when I can get my hands on it.

  3. ^

    Though we should note that the relative difference between two directions is more important the directions themselves. Since the predictions go through a softmax, the classifier with directions (A,B,C) produces the same results as with directions (A+X, B+X, C+X) for any X. The invariant properties learned by the classifier are the differences B-A and C-A, or the “rectified directions” like A-0.5(B+C).