Investigating task-specific prompts and sparse autoencoders for activation monitoring

Link post

We’d like to share the most recent paper from the OpenAI Interpretability Team. The question of how activation probes with sparse autoencoders compare with baseline activation probes has gotten a lot of attention recently. With this work, we took a step back and asked: how do both these activation probing methods compare with the common baseline of just asking a language model in natural language? And how do they combine?

We find that a combination of “just asking the model” and activation probing, similar to that introduced in the representation engineering paper, improves on either prompting or activation probing by itself. This method is highly data efficient and generalizes well, but requires additional inference-time compute for the monitoring task-specific prompt. We also find SAE-based probing methods outperform raw activation probing in most of the monitoring tasks we studied.

Contributors: Henk Tillman, Dan Mossing