TLDR

The Google DeepMind mech interp team is releasing Gemma Scope 2: a suite of SAEs & transcoders trained on the Gemma 3 model family
- Neuronpedia demo here, access the weights on HuggingFace here, try out the Colab notebook tutorial here ^[1]
Key features of this relative to the previous Gemma Scope release:
- More advanced model family (V3 rather than V2) should enable analysis of more complex forms of behaviour
- More comprehensive release (SAEs on every layer, for all models up to size 27b, plus multi-layer models like crosscoders and CLTs)
- More focus on chat models (every SAE trained on a PT model has a corresponding version finetuned for IT models)
Although we’ve deprioritized fundamental research on tools like SAEs (see reasoning here), we still hope these will serve as a useful tool for the community

Some example latents

Here are some example latents taken from the residual stream SAEs for Gemma V3 27B IT.

What the release contains

This release contains SAEs trained on 3 different sites (residual stream, MLP output and attention output) as well as MLP transcoders (both with and without affine skip connections), for every layer of each of the 10 models in the Gemma 3 family (i.e. sizes 270m, 1b, 4b, 12b and 27b, both the PT and IT versions of each). For every layer, we provide 4 models (widths 16k and 262k, and two different target L0 values). Rather than giving the exact L0s, we label them “small” (10-20), “medium” (30-60) and “big” (90-150).

Additionally, for 4 layers in each model (at depths 25%, 50%, 65%, 85%) we provide each of these single-layer models for a larger hyperparameter sweep over widths and L0 values, including residual stream SAEs with widths up to 1m for every model.

Lastly, we’ve also included several multi-layer models: CLTs on 270m & 1b, and weakly causal crosscoders trained on the concatenation of 4 layers (the same 4 depths mentioned above) for every base model size & type.

All models are JumpReLU, trained using a quadratic L0 penalty along with an additional frequency penalty which prevented the formation of high-frequency features. We also used a version of Matryoshka loss during training, which has been documented to help reduce the instance of feature absorption.

Table of available models, taken from the technical report.

Which ones should you use?

If you’re interested in finding features connected to certain behavioural traits (to perform steering, or to better attribute certain model behaviours, or analyze directions you’ve found inside the model using supervised methods etc), we recommend using the residual stream models trained on a subset of the model layers (e.g. here). The 262k-width models with medium L0 values (in the 30-60 range) should prove suitable for most people, although the 16k and 65k widths may also prove useful. All the examples in the screenshots above were from 262k-width medium-L0 SAEs finetuned on Gemma V3 270m IT.

If you’re interested in doing circuit-style analysis e.g. with attribution graphs, we recommend using the suite of transcoders we’ve trained on all layers of the model, e.g. here. Affine skip connections were strictly beneficial so we recommend using these. Models with larger width lead to richer analysis, but the computational cost of circuit-style work can grow very large especially for bigger base models, so you may wish to use 16k width rather than 262k. Neuronpedia will shortly be hosting an interactive page which allows you to generate and explore your own attribution graphs using these transcoders.

Some useful links

Here’s all the relevant links to go along with this release:

GDM blog post
Technical Report
Tweet thread
Neuronpedia demo (inference and attribution graphs coming soon!)
HuggingFace repo, to access all model weights
Colab notebook demo / tutorial

^
The ARENA material will also be updated to use this new suite of models, in place of the models from the 2024 Gemma Scope release.

Announcing Gemma Scope 2

TLDR

Some example latents

What the release contains

Which ones should you use?

Some useful links