Super nice! Will be curious to see the LLM results. A couple thoughts/questions:
The F1 scores are impressive but the MCC is still substantially below 1. Is it odd that each feature in the activations has an associated SAE latent that fires exactly when it ought to and yet its decoder directions are still pretty misaligned? Is this hedging? Do your SAEs have exactly 16k latents?
Before applying this new SAE to language models, you could see how hyperparameter-sensitive it is by creating multiple variants of SynthSAEBench with different correlation structure, Zipfian exponents over feature firing probabilities, etc. and see how well the optimal SAE hyperparams transfer from one to another.
The F1 scores are impressive but the MCC is still substantially below 1. Is it odd that each feature in the activations has an associated SAE latent that fires exactly when it ought to and yet its decoder directions are still pretty misaligned? Is this hedging? Do your SAEs have exactly 16k latents?
The SAEs have 4096 latents, so intentionally more narrow than the synthetic model. The idea was that we’re almost certainly never training SAEs that have the full number of features of an LLM, we should also make sure the SAEs here are also intentionally too narrow.
I was also surprised that this doesn’t mess up the F1 probing of the SAE more—I assumed that hedging due to the SAE being too narrow would make it impossible for the encoder to act as that accurate of a probe, but that’s seemingly not the case!
I also tried training a 4096 width decoder on the ground-truth activations to get a sense of what the ceiling is for MCC with a perfect encoder given the SAE width, and it gets MCC around 0.87, so there’s definitely more room for improvement on that metric. I’m not sure there’s a way to get above 0.87 without some novel reconstruction loss or something though with only 4096 latents.
Before applying this new SAE to language models, you could see how hyperparameter-sensitive it is by creating multiple variants of SynthSAEBench with different correlation structure, Zipfian exponents over feature firing probabilities, etc. and see how well the optimal SAE hyperparams transfer from one to another.
This is what I plan to do next! I suspect a lot of the high scores here are just Claude over-optimizing for this specific synthetic model, so making a suite of models with different properties should hopefully make for a more robust test-bed.
If you want to go full autonomous research mode you could even have another Claude find adversarial parameters of the SynthSAEBench dataset (within some reasonable constraints) to see where the methods break or would perform worse than baselines.
I imagine you could find some nice robust improvements this way.
Super nice! Will be curious to see the LLM results. A couple thoughts/questions:
The F1 scores are impressive but the MCC is still substantially below 1. Is it odd that each feature in the activations has an associated SAE latent that fires exactly when it ought to and yet its decoder directions are still pretty misaligned? Is this hedging? Do your SAEs have exactly 16k latents?
Before applying this new SAE to language models, you could see how hyperparameter-sensitive it is by creating multiple variants of SynthSAEBench with different correlation structure, Zipfian exponents over feature firing probabilities, etc. and see how well the optimal SAE hyperparams transfer from one to another.
The SAEs have 4096 latents, so intentionally more narrow than the synthetic model. The idea was that we’re almost certainly never training SAEs that have the full number of features of an LLM, we should also make sure the SAEs here are also intentionally too narrow.
I was also surprised that this doesn’t mess up the F1 probing of the SAE more—I assumed that hedging due to the SAE being too narrow would make it impossible for the encoder to act as that accurate of a probe, but that’s seemingly not the case!
I also tried training a 4096 width decoder on the ground-truth activations to get a sense of what the ceiling is for MCC with a perfect encoder given the SAE width, and it gets MCC around 0.87, so there’s definitely more room for improvement on that metric. I’m not sure there’s a way to get above 0.87 without some novel reconstruction loss or something though with only 4096 latents.
This is what I plan to do next! I suspect a lot of the high scores here are just Claude over-optimizing for this specific synthetic model, so making a suite of models with different properties should hopefully make for a more robust test-bed.
If you want to go full autonomous research mode you could even have another Claude find adversarial parameters of the SynthSAEBench dataset (within some reasonable constraints) to see where the methods break or would perform worse than baselines.
I imagine you could find some nice robust improvements this way.