A small update to the Sparse Coding interim research report

This is a linkpost to a set of slides containing an update to a project that was the subject of a previous post ([Interim research report] Taking features out of superposition with sparse autoencoders).

The update is very small and scrappy. We haven’t had much time to devote to this project since posting the Interim Research Report.

TL;DR for the slides:

  • We trained a minuscule language model (LM) (residual size = 16; 6 layers) and then trained sparse autoencoders on MLP activations (dimension = 64) from the third layer of that model.

  • We found that, when we compared the ‘ground truth feature recovery’ plots, the plots for the toy data and LM data were much more similar than in the Interim Research Report.

  • Very, very tentatively, we found the layer had somewhere between 512-1024 features. By labelling a subset of these features, we estimate there are roughly 600 easily labellable (monosemantic) features. For instance, we found a feature that activates for a period immediately after ‘Mr’, ‘Mrs’, or ‘Dr’.

  • We suspect that the reason the toy data and LM data plots had previously looked different was due to severely undertrained sparse autoencoders.

We’re hopeful that with more time to devote to this project we can confirm the results and apply the method to larger LMs. If it works, it would give us the ability to tell mechanistic stories about what goes on inside large LMs in terms of monosemantic features.