Announcing Gemma Scope 2
TLDR
The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
Key features of this relative to the previous Gemma Scope release:
More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
Although we’ve deprioritized fundamental research on tools like SAEs (see reasoning here), we still hope these will serve as a useful tool for the community
Some example latents
Here are some example latents taken from the residual stream SAEs for Gemma V3 27B IT.
What the release contains
This release contains SAEs trained on 3 different sites (residual stream, MLP output and attention output) as well as MLP transcoders (both with and without affine skip connections), for every layer of each of the 10 models in the Gemma 3 family (i.e. sizes 270m, 1b, 4b, 12b and 27b, both the PT and IT versions of each). For every layer, we provide 4 models (widths 16k and 262k, and two different target L0 values). Rather than giving the exact L0s, we label them “small” (10-20), “medium” (30-60) and “big” (90-150).
Additionally, for 4 layers in each model (at depths 25%, 50%, 65%, 85%) we provide each of these single-layer models for a larger hyperparameter sweep over widths and L0 values, including residual stream SAEs with widths up to 1m for every model.
Lastly, we’ve also included several multi-layer models: CLTs on 270m & 1b, and weakly causal crosscoders trained on the concatenation of 4 layers (the same 4 depths mentioned above) for every base model size & type.
All models are JumpReLU, trained using a quadratic L0 penalty along with an additional frequency penalty which prevented the formation of high-frequency features. We also used a version of Matryoshka loss during training, which has been documented to help reduce the instance of feature absorption.
Which ones should you use?
If you’re interested in finding features connected to certain behavioural traits (to perform steering, or to better attribute certain model behaviours, or analyze directions you’ve found inside the model using supervised methods etc), we recommend using the residual stream models trained on a subset of the model layers (e.g. here). The 262k-width models with medium L0 values (in the 30-60 range) should prove suitable for most people, although the 16k and 65k widths may also prove useful. All the examples in the screenshots above were from 262k-width medium-L0 SAEs finetuned on Gemma V3 270m IT.
If you’re interested in doing circuit-style analysis e.g. with attribution graphs, we recommend using the suite of transcoders we’ve trained on all layers of the model, e.g. here. Affine skip connections were strictly beneficial so we recommend using these. Models with larger width lead to richer analysis, but the computational cost of circuit-style work can grow very large especially for bigger base models, so you may wish to use 16k width rather than 262k. Neuronpedia will shortly be hosting an interactive page which allows you to generate and explore your own attribution graphs using these transcoders.
Some useful links
Here’s all the relevant links to go along with this release:
Neuronpedia demo (inference and attribution graphs coming soon!)
HuggingFace repo, to access all model weights
- ^
The ARENA material will also be updated to use this new suite of models, in place of the models from the 2024 Gemma Scope release.
Please could you write the equation for this loss term? I couldn’t find it in the paper.