Cool work!
Have you tried to generate autointerp of the SAE features? I’d be quite excited about a loop that does the following:
take an SAE feature, get the max activating examples.
Use a multi-modal model, maybe Claude, to do autointerp via images of each of the chess positions (might be hard but with the right prompt seems doable).
Based on a codebase that implements chess logic which can be abstracted away (eg: has functions that take a board state and return whether or not statements are true like “is the king in check?”), get a model to implement a function that matches it’s interpretation of the feature.
Use this to generate a labelled dataset on which you then train a linear probe.
Compare the probe activations to the feature activations. In particular, see whether you can generate a better automatic interpretation of the feature if you prompt with examples of where it differs from the probe.
I suspect this is nicer than language modelling in that you can programmatically generate your data labels from explanations rather than relying on LMs. Of course you could just a priori decide what probes to train but the loop between autointerp and next probe seems cool to me. I predict that current SAE training methods will result in long description lengths or low recall and the tradeoff will be poor.
Cool work! I’d be excited to see whether latents found via this method are higher quality linear classifiers when they appear to track concepts (eg: first letters) and also if they enable us to train better classifiers over model internals than other SAE architectures or linear probes (https://transformer-circuits.pub/2024/features-as-classifiers/index.html).