A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.
A $1 training run would be training 6 SAEs across 6 sparsities at 16K width on Gemma-2-2B for 200M tokens. This includes generating the activations, and it would be cheaper if the activations are precomputed. In practice this seems like large enough scale to validate ideas such as the Matryoshka SAE or the BatchTopK SAE.
Yeah, if you’re doing this, you should definitely pre compute and save activations